This Page Is Inserted by IFW Operations 
and is not a part of the Official Record 

BEST AVAILABLE IMAGES 



Defective images within this document are accurate representations of 
the original documents submitted by the applicant. 

Defects in the images may include (but are not limited to): 



BLACK BORDERS 

TEXT CUT OFF AT TOP, BOTTOM OR SIDES 
FADED TEXT 
ILLEGIBLE TEXT 
SKEWED/SLANTED IMAGES 
COLORED PHOTOS 

BLACK OR VERY BLACK AND WHITE DARK PHOTOS 
GRAY SCALE DOCUMENTS 



IMAGES ARE BEST AVAILABLE COPY. 



As rescanning documents will not correct images, 
please do not report the images to the 
Image Problem Mailbox. 



IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 

DECLARATION OF VI SHWANATH R. IYER, Ph.D. 
UNDER 37 C.F.R. § 1.132 

I, VISHWANATH R. IYER, Ph.D., declare and state as 

follows : 

1. I am an Assistant Professor in the Section of 
Molecular Genetics and Microbiology, Institute of Cellular and 
Molecular Biology, University of Texas at Austin, where my 
laboratory currently studies global transcriptional control in 
yeast, gene expression programs during human cell 
proliferation, and genome-wide transcription factor targets in 
yeast and human. Immediately prior to this position, I spent 
four years as a postdoctoral fellow in the laboratory of 
Patrick 0. Brown at Stanford University studying the 
transcriptional programs of yeast and of human cells. My 
curriculum vitae is attached hereto as Exhibit A. 

2. Beginning in Dr. Brown's laboratory, where I 
helped to develop the first whole genome arrays for yeast and 
early versions of highly representative cDNA arrays for human 
cells, and continuing to the present day, I have used 
microarray-based gene expression analysis as a principal 
approach in much of my research. 

3. Representative publications describing this 
work include: 



DeRisi J. et al., "Exploring the metabolic and 
genetic control of gene expression on a genomic 
scale," Science 278:680-686 (1997) ; a 

Marton et al . , "Drug target validation and 
identification of secondary drug target effects 
using DNA microarrays, " Nature Med. 4:1293-1301 
(1998) ; 2 

Iyer et al., "The transcriptional program in 
the response of human fibroblasts to serum, " 
Science 283:83-87 (1999) ; s and 

Ross et al. t "Systematic variation in gene 
expression patterns in human cancer cell lines," 
Nature Genetics 24: 227-235 (2000). 4 

Two of the papers describe our use of microarray-based 
expression profiling to explore the metabolic reprogramming 
that occurs during major physiological changes, both in yeast 
(DeRisi et al . , during the shift from fermentation to 
respiration) and in human cells (Iyer et al . , human 
fibroblasts exposed to serum) . One reference describes our 
use of expression profile analysis in drug target validation 
and identification of secondary drug effects (Marton et al.). 
And one describes our use of expression profiling as a 
molecular phenotyping tool to discriminate among human cancer 
cells (Ross et al.). 

4. Whether used to elucidate basic physiological 
responses, to study primary and secondary drug effects, or to 
discriminate and classify human cancers, expression profiling 
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as we have practiced it relies for its power on comparison of 
patterns of expression. 



5. For example, we have demonstrated that we can 
use the presence or absence of a characteristic drug 
"signature" pattern of altered gene expression in drug-treated 
cells to explore the mechanism of drug action, and to identify 
secondary effects that can signal potentially deleterious drug 
side effects- As another example, we have demonstrated that 
gene expression patterns can be used to classify human tumor 
cell lines. While it is of course advantageous to know the 
biological function of the encoded gene products in order to 
reach a better understanding of the cellular mechanisms 
underlying these results, these pattern-based analyses do not 
require knowledge of the biological function of the encoded 
proteins . 

6 . The resolution of the patterns used in such 
comparisons is determined by the number of genes detected: the 
greater the number of genes detected, the higher the 
resolution of the pattern. It goes without saying that higher 
resolution patterns are generally more useful in such 
comparisons than lower resolution patterns. With such higher 
resolutions comes a correspondingly higher degree of 
statistical confidence for distinguishing different patterns, 
as well as identifying similar ones. 

7. Each gene included as a probe on a microarray 
provides a signal that is specific to the cognate transcript, 
at least to a first approximation. 5 Each new gene-specific 



5 In a more nuanced view, it is certainly possible for a probe to 

signal the presence of a variety of splice variants of a single gene, 

(Continued...) 



probe added to a microarray thus increases the number of genes 
detectable by the device, increasing the resolving power of 
the device. As I note above, higher resolution patterns are 
generally more useful in comparisons than lower resolution 
patterns. Accordingly, each new gene probe added to a 
microarray increases the usefulness of the device in gene 
expression profiling analyses. This proposition is so well- 
established as to be virtually an axiom in the art, and has 
been as long as I have been working in the field, and 
certainly since the time I embarked on the production of whole 
genome arrays in early 1996. Simply put, arrays with fewer 
gene-specific probes are inferior to arrays with more gene- 
specific probes. 

8. For example, our ability to subdivide cancers 
into discriminable classes by expression profiling is limited 
by the resolution of the patterns produced. With more genes 
contributing to the expression patterns, we can potentially 
draw finer distinctions among the patterns, thus subdividing 
otherwise indistinguishable cancers into a greater number of 
classes; the greater the number of classes, the greater the 
likelihood that the cancers classified together will respond 
similarly to therapeutic intervention, permitting better 
individualization of therapy and, we hope, better treatment 
outcomes . 

9. If a gene does not change expression in an 
experiment, or if a gene is not expressed and produces no 



(...Continued) 

without discriminating among them, and for a probe to signal the presence 
of a variety of allelic variants of a single gene, again without 
discriminating among them. 



signal in an experiment, that is not to say that the probe 
lacks usefulness on the array; it only means that an 
insufficient number of conditions have been sampled to 
identify expression changes. In fact, an experiment showing 
that a gene is not expressed or that its expression level does 
not change can be equally informative. To provide maximum 
versatility as a research tool, the microarray should 
include and as a biologist I would want my microarray to 
include — each newly identified gene as a probe. 

10. I declare further that all statements made 
herein of my own knowledge are true and that all statements 
made on information and belief are believed to be true, and 
further that these statements were made with the knowledge 
that willful false statements and the like so made are 
punishable by fine or imprisonment, or both, under 
Section 1001 of Title 18 of the United States Code and may 
jeopardize the validity of any patent application in which 
this declaration is filed or any patent that issues thereon. 



October 20, 2003 



VISHWANATH R. IYER, Ph.D. 
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Program: blastp 
Sequence ID(s) : 

O 1459372CD1 vs. 



genpept!37 



NCBI-BLASTP 2.0.10 [Aug-26-1999 ] 



m 



Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, 
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), 
"Gapped BLAST and PSI-BLAST: a new generation of protein database search 
programs", Nucleic Acids Res. 25:3389-3402. 

Query= 1459372CD1 

(269 letters) 

Database : genpeptl37 

1,534,369 sequences; 474,463,515 total letters 
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J-domain protein Jiv [Bos taurus] 
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gl5777193 


J-domain protein Jiv [Bos taurus] 
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g26337373 


unnamed protein product [Mus musculus] 
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gl2857284 


unnamed protein product [Mus musculus] 
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gl5843561 


DnaJl protein [Bos taurus] 
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gl5029846 


RIKEN cDNA 5730551F12 gene [Mus musculus] 
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g26349793 


unnamed protein product [Mus musculus] 


531 


e- 
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g26337271 


unnamed protein product [Mus musculus] 


454 


e- 


-126 



> g!6550798 unnamed protein product [Homo sapiens] 
Length = 412 

Score = 562 bits (1433), Expect = e-159 

Identities = 269/269 (100%), Positives = 269/269 (100%) 



Query: 1 MAGVPEDELNPFHVLGVEATASDVELKKAYRQLAVMVHPDKNHHPRAEEAFKVLRAAWDI 60 
MAGVPEDELNPFHVLGVEATASDVELKKAYRQLAVMVHPDKNHHPRAEEAFKVLRAAWDI 
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Sbjct: 144 MAGVPEDELNPFHVLGVEATASDVELKKAYRQLAVMVHPDKNHHPRAEEAFKVLRAAWDI 203 

Query: 61 VSNAEKRKEYEMKRMAENELSRSVTtfEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 120 

VSNAEKRKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 
Sbjct: 204 VSNAEKRKE YEMKRMAENE L S RS VNEF LS KLQDDLKE AMNTMMC S RC QGKHRRF EMDRE P 263 

Query: 121 KSARYCAECNRLHPAEEGDFWAES SMLGLKITYFALMDGKVYDITEWAGCQRVGIS PDTH 180 

KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 
Sbjct: 264 KSARYCAECNRLHPAEEGDFWAES SMLGLKITYFALMDGKVYDITEWAGCQRVGIS PDTH 323 

Query: 181 RVPYHISFGSRIPGTRGRQRATPDAPPADLQDFLSRIFQVPPGQMPNGNFFAAPQPAPGA 240 

RVPYHISFGSRI PGTRGRQRATPDAPPADLQDFLSRIFQVPPGQMPNGNFFAAPQPAPGA 
Sbjct: 324 RVPYHISFGSRI PGTRGRQRATPDAPPADLQDFLSRIFQVPPGQMPNGNFFAAPQPAPGA 383 

Query: 241 AAASKPNSTVPKGEAKPKRRKKVRRPFQR 269 

AAAS KPNSTVPKGEAKPKRRKKVRRPF QR 
Sbjct: 384 AAASKPNSTVPKGEAKPKRRKKVRRPFQR 412 



> g!4194055 dopamine receptor interacting protein [Rattus norvegicus] 
Length = 701 

Score = 536 bits (1365) , Expect = e-151 

Identities = 252/269 (93%), Positives = 260/269 (95%) 

Query: 1 MAGVPE DELNPF HVLGVEATAS DVELKKAYRQLAVMVHPDKNHH PRAEEAFKVLRAAWDI 60 

MAGVPEDELNPFHVLGVTIATASD+ELKKAYRQLAvlWHPDKNHHPRAEEAFKVLRAAWDI 
Sbjct: 433 MAGVPEDELNPF HVLGVEATAS DIELKKAYRQLAVMVHPDKNHH PRAEEAFKVLRAAWDI 492 

Query: 61 VSNAEKRKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMC SRC QGKHRRF EMDRE P 120 

VSN E+RKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 
Sbjct: 493 VSNPERRKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 552 

Query: 121 KSARYCAECNRLHPAEEGDFWAES SMLGLKITYFALMDGKVYDITEWAGCQRVGIS PDTH 180 

KSARYCAECNRLHPAEEGDFWAES SMLGLKITYFALMDGKVYDITEWAGCQRVGIS PDTH 
Sbjct: 553 KSARYCAECNRLHPAEEGDFWAES SMLGLKITYFALMDGKVYDITEWAGCQRVGIS PDTH 612 

Query: 181 RVPYHISFGSRI PGTRGRQRATPDAPPADLQDFLSRIFQVPPGQMPNGNFFAAPQPAPGA 240 

RVPYHISFGSR+PGT GRQRATP+ + PPADLQDFLSRI FQVPPG M NGNFFAAP P PG 
Sbjct: 613 RVPYHISFGSRVPGTSGRQRATPES PPADLQDFLSRI FQVPPG PMSNGNFFAAPHPGPGT 672 

Query: 241 AAASKPNSTVPKGEAKPKRRKKVRRPFQR 269 

+ S + PNS + VPKGEAKPKRRKKVRRPFQR 
Sbjct: 673 TSTSRPNSSVPKGEAKPKRRKKVRRPFQR 701 



> g!5777195 J-domain protein Jiv [Bos taurus] 
Length = 699 

Score = 535 bits (1364), Expect = e-151 

Identities = 258/269 (95%), Positives = 260/269 (95%), Gaps = 4/269 (1%) 

Query: 1 MAGVPEDELNPFHVLGVEATASDVELKKAYRQLAVMVHPDKNHHPRAEEAFKVLRAAWDI 60 

MAGVPEDELNPFHVLGVEATASDVELKKAYRQLAVMVHPDKNHHPRAEEAFKVLRAAWDI 
Sbjct: 435 MAGVPEDELNPFHVLGVEATASDVELKKAYRQLAVWHPDKNHHPRAEEAFKVLRAAWDI 494 

Query: 61 VSNAEKRKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 120 

VSN E+RKEYEMKRMAENELSRSVNEFLSKLQ EAMNTMMC SRC QGKHRRF EMDRE P 
Sbjct: 495 VSNPERRKEYEMKRMAENELSRSVNEFLSKLQ EAMNTMMC SRC QGKHRRF EMDRE P 550 

Query: 121 KSARYCAECNRLHPAEEGDFWAES SMLGLKITYFALMDGKVYDITEWAGCQRVGIS PDTH 180 

KSARYCAECNRLHPAEEGDFWAES SMLGLKITYFALMDGKVYDITEWAGCQRVGIS PDTH 
Sbjct: 551 KSARYCAECNRLHPAEEGDFWAES SMLGLKITYFALMDGKVYDITEWAGCQRVGIS PDTH 610 

Query: 181 RVPYHISFGSRI PGTRGRQRATPDAPPADLQDFLSRIFQVPPGQMPNGNFFAAPQPAPGA 240 

RVPYHISFGSR+PGT GRQRATPDAPPADLQDFLSRI FQVPPGQM NGNFFAAPQP PGA 
Sbjct: 611 RVPYHISFGSRMPGTSGRQRATPDAPPADLQDFLSRIFQVPPGQMSNGNFFAAPQPGPGA 670 
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Query: 241 AAASKPNSTVPKGEAKPKRRKKVRRPFQR 269 

AASKPNSTVPKGEAKPKRRKKVRRPFQR 
Sbjct: 671 T AASKPNSTVPKGEAKPKRRKKVRRPFQR 699 



> g!5777193 J-domain protein Jiv [Bos taurus] 
Length = 699 

Score = 535 bits (1364), Expect = e-151 

Identities = 258/269 (95%), Positives = 260/269 (95%), Gaps = 4/269 (1%) 

Query: 1 MAGVPEDELNPFHVLGVEATASDVELKKAYRQUW^ 60 

MAGVPEDELNPFHVLGVEATASDVELKKAYRQLAVMVHPDKNHHPRAEEAFKVLRAAWDI 
Sbjct: 435 MAGVPE DELNPF HVLGVEATAS DVELKKAYRQLAVMVH PDKNHH PRAE EAFKVLRAAWD I 494 

Query: 61 VSNAEKRKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 120 

VSN E+RKEYEMKRMAENELSRSVNEFLSKLQ E AMNTMMC S RC QGKHRRF EMDRE P 
Sbjct: 495 VSNPERRKEYEMKRMAENELSRSVNEFLSKLQ EAMNTMMC SRC QGKHRRF EMDRE P 550 

Query: 121 KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVTDITEWAGCQRVGISPDTH 180 

KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 
Sbjct: 551 KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 610 

Query: 181 RVPYHISFGSRIPGTRGRQRATPDAPPADLQDFLSRIFQVPPGQMPNGNFFAAPQPAPGA 240 

RVPYHISFGSR+PGT GRQRATPDAPPADLQDFLSRIFQVPPGQM NGNFFAAPQP PGA 
Sbjct: 611 RVPYHISFGSRMPGTSGRQRATPDAPPADLQDFLSRIFQVPPGQMSNGNFFAAPQPGPGA 670 

Query: 241 AAASKPNSTVPKGEAKPKRRKKVRRPFQR 269 

AAS KPNSTVPKGE AKPKRRKKVRRPFQR 
Sbjct: 671 TAAS KPNSTVPKGE AKPKRRKKVRRPFQR 699 



> g26337373 unnamed protein product [Mus musculus] 
Length =703 

Score = 534 bits (1361), Expect = e-150 

Identities = 251/269 (93%), Positives = 259/269 (95%) 

Query: 1 MAGVPEDELNPF HVLGVEATAS DVELKKAYRQLAVMVH PDKNHH PRAE EAFKVLRAAWDI 60 

MAGVPEDELNPF HVLGVEATAS D ELKKAYRQLAVMVH PDKNHH PRAEEAFK+ LRAAWDI 
Sbjct: 435 MAGVPEDELNPFHVLGVEATASDTELKKAYRQLAVMVHPDKNHHPRAEEAFKILRAAWDI 494 

Query: 61 VSNAEKRKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 120 

VSN E+RKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 
Sbjct: 495 VSNPERRKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 554 

Query: 121 KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 180 

KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 
Sbjct: 555 KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 614 

Query: 181 RVPYHISFGSRIPGTRGRQRATPDAPPADLQDFLSRIFQVPPGQMPNGNFFAAPQPAPGA 240 

RVPYHISFGSR+PGT GRQRATP+ + PPADLQDFLSRI FQVPPG M NGNFFAAP P PG 
Sbjct: 615 RVPYHISFGSRVPGTSGRQRATPESPPADLQDFLSRIFQVPPGPMSNGNFFAAPHPGPGT 674 

Query: 241 AAASKPNSTVPKGEAKPKRRKKVRRPFQR 269 

+ S + PNS + VPKGEAKPKRRKKVRRPFQR 
Sbjct: 675 TSTSRPNSS VPKGEAKPKRRKKVRRPFQR 703 



> g!2857284 unnamed protein product [Mus musculus] 
Length = 703 

Score = 534 bits (1361), Expect = e-150 

Identities = 251/269 (93%), Positives = 259/269 (95%) 

Query: 1 MAGVPEDELNPFHVLGVEATAS DVELKKAYRQLAVMVH PDKNHH PRAE EAFKVLRAAWDI 60 
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MAGVPEDELNPFHVLGVEATASD ELKKAYRQLAVMVHPDKNHHPRAEEAFK+LRAAWDI 
Sbjct: 435 MAGVPEDELNPFHVLGVEATASDTELKKAYRQLAVMVHPDKNHHPRAEEAFKILRAAWDI 494 

Query: 61 VSNAEKRKEYEMKI^MAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 120 

VSN E+RKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 
Sbjct: 495 VSNPERRKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 554 

Query: 121 KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 180 

KS ARYC AECNRLHPAEEGDFWAES SMLGLKITYFALMDGKVYDITEWAGCQRVGI S PDTH 
Sbjct: 555 KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 614 

Query: 181 RVPYHISFGSRIPGTRGRQRATPDAPPADLQDFLSRIFQVPPGQMPNGNFFAAPQPAPGA 240 

RVPYHISFGSR+PGT GRQRATP+ + PPADLQDFLSRI FQVPPG M NGNFFAAP P PG 
Sbjct: 615 RVPYHISFGSRVPGTSGRQRATPES PPADLQDFLSRI FQVPPG PMSNGNFFAAPHPGPGT 674 

Query: 241 AAASKPNSTVPKGEAKPKRRKKVRRPFQR 269 

+ S + PNS + VPKGE AKPKRRKKVRRPFQR 
Sbjct: 675 TSTSRPNSS VPKGE AKPKRRKKVRRPFQR 703 

> gl5843561 DnaJl protein [Bos taurus] 
Length = 659 

Score = 533 bits (1358), Expect = e-150 

Identities = 257/269 (95%), Positives = 259/269 (95%), Gaps = 4/269 (1%) 

Query: 1 MAGVPEDELNPFHVLGVEATASDVELKKAYRQLAVWHPDKNHHPRAEEAFKVLRAAWDI 60 

MAGVPEDELNPFHVLGVEATASDVELKKAYRQLAVMVHPDKNHHPRAEEAFKVLRAAWDI 
Sbjct: 395 MAGVPEDELNPFHVLGVEATASDVELKKAYRQLAVMVHPDKNHHPRAEEAFKVLRAAWDI 454 

Query: 61 VSNAEKRKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 12 0 

VSN E+RKEYEMKRMAENELSRSVNEFLSKLQ E AMNTMMC S RC QGKHR FEMDREP 
Sbjct: 455 VSNPERRKEYEMKRMAENELSRSVNEFLSKLQ EAMNTMMCSRCQGKHRS FEMDREP 510 

Query: 121 KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 180 

KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 
Sbjct: 511 KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 57 0 

Query: 181 RVPYHISFGSRIPGTRGRQRATPDAPPADLQDFLSRIFQVPPGQMPNGNFFAAPQPAPGA 240 

RVPYHISFGSR+PGT GRQRATPDAPPADLQDFLSRIFQVPPGQM NGNFFAAPQP PGA 
Sbjct: 571 RVPYHISFGSRMPGTSGRQRATPDAPPADLQDFLSRIFQVPPGQMSNGNFFAAPQPGPGA 63 0 

Query: 241 AAASKPNSTVPKGEAKPKRRKKVRRPFQR 269 

AASKPNSTVPKGE AKPKRRKKVRRPFQR 
Sbjct: 631 T AASKPNSTVPKGE AKPKRRKKVRRPFQR 659 

>gl5029846 RIKEN cDNA 5730551F12 gene [Mus musculus] 
~ Length =7 03 

Score = 533 bits (1357), Expect = e-150 

Identities = 250/269 (92%), Positives = 258/269 (94%) 

Query: 1 MAGVPEDELNPFHVLGVEATASDV^LKKAYRQLAVTWHPDKNHHPRAEEAFKVLRAAWDI 60 

MAGVPEDELNPFHVLGVEATASD ELKKAYRQLAVMVHPDKNHHPRAEEAFK+LRAAWDI 
Sbjct: 435 MAGVPEDELNPFHVLGVEATASDTELKKAYRQLAVMVHPDKNHHPRAEEAFKILRAAWDI 494 

Query: 61 VSNAEKRKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 120 

VSN E+RKEYEMKRMAENELSRSVNEFLSKLQDDLKE AMNTMMC SRCQGKHRRFEMDREP 
Sbjct: 495 VSNPERRKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 554 

Query: 121 KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 180 

KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 
Sbjct: 555 KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 614 

Query: 181 RVPYHISFGSRIPGTRGRQRATPDAPPADLQDFLSRIFQVPPGQMPNGNFFAAPQPAPGA 240 
RVPYHISFGSR+PGT GRQRATP++PP DLQDFLSRIFQVPPG M NGNFFAAP P PG 
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Sbjct: 615 RVPYHISFGSRVPGTSGRQRATPESPPVDLQDFLSRIFQVPPGPMSNGNFFAAPHPGPGT 674 

Query: 241 AAASKPNSTVPKGEAKPKRRKKVRRPFQR 269 

+ S + PNS + VPKGEAKPKRRKKVRRPFQR 
Sbjct: 675 TSTSRPNSSVPKGEAKPKRRKKVRRPFQR 703 



> g26349793 unnamed protein product [Mus musculus] 
Length = 703 

Score = 531 bits (1354), Expect = e-150 

Identities = 250/269 (92%), Positives = 258/269 (94%) 

Query: 1 MAGVPEDELNPFHVLGVEATASDVELKKAYRQLAVMVHPDKNHHPRAEEAFKVLRAAWDI 60 

MAGVPEDELNPFHVLGVEATASD ELKKAYRQLAVMVHPDKNHHPRAEEAFK+LRAAWDI 
Sbjct: 435 MAGVPEDELNPFHVLGVEATASDTELKKAYRQLAVMVHPDKNHHPRAEEAFKILRAAWDI 494 

Query: 61 VSNAEKRKE YEMKRMAENEL SRS VNEF L SKLQDDLKEAMNTMMC S RC QGKHRRFEMDRE P 120 

VSN E + RKE YEMKRMAENEL SRS VNEF L S KLQDDLKEAMNTMMC S RC QGKHRRFEMDRE P 
Sbjct: 495 VSNPERRKE YEMKRMAENEL SRS VNEF L SKLQDDLKEAMNTMMC SRC QGKHRRFEMDRE P 554 

Query: 121 KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 180 

KSA YCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 
Sbjct: 555 KSAGYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 614 

Query: 181 RVPYHISFGSRIPGTRGRQRATPDAPPADLQDFLSRIFQVPPGQMPNGNFFAAPQPAPGA 240 

RVPYHISFGSR+PGT GRQRATP+ + PPADLQDFLSRIFQVPPG M NGNFFAAP P PG 
Sbjct: 615 RVPYHISFGSRVPGTSGRQRATPESPPADLQDFLSRIFQVPPGPMSNGNFFAAPHPGPGT 674 

Query: 241 AAASKPNSTVPKGEAKPKRRKKVRRPFQR 269 

+ S + PNS + VPKGEAKPKRRKKVRRPFQR 
Sbjct: 675 TSTSRPNSSVPKGEAKPKRRKKVRRPFQR 703 



> g26337271 unnamed protein product [Mus musculus] 
Length = 678 

Score = 454 bits (1156), Expect = e-126 

Identities = 216/238 (90%), Positives = 221/238 (92%) 

Query: 1 MAGVPEDELNPFHVLGVEATASDVELKKAYRQLAVMVHPDKNHHPRAEEAFKVLRAAWDI 60 

MAGVPEDELNPFHVLGVEATASD ELKKAYRQLAVMVH PDKNHH PRAEE AFK+ LRAAWDI 
Sbjct: 435 MAGVPEDELNPFHVLGVEATASDTELKKAYRQLAVMVHPDKNHHPRAEEAFKILRAAWDI 494 

Query: 61 VSNAEKRKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 120 

VSN E + RKE YEMKRMAENEL SRS VNEF L S KLQDDLKEAMNTMMC S RC QGKHRRFEMDRE P 
Sbjct: 495 VSNPERRKEYEMKRMAENELSRSVNEFLSKLQDDLKEAMNTMMCSRCQGKHRRFEMDREP 554 

Query: 121 KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 180 

KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 
Sbjct: 555 KSARYCAECNRLHPAEEGDFWAESSMLGLKITYFALMDGKVYDITEWAGCQRVGISPDTH 614 

Query: 181 RVPYHISFGSRI PGTRGRQRATPDAPPADLQDFLSRIFQVPPGQMPNGNFFAAPQPAP 238 

RVPYHISFGSR+PGT GRQRATP++PPADLQDFLSRIFQVP G P P 

Sbjct: 615 RVPYHISFGSRVPGTSGRQRATPESPPADLQDFLSRIFQVPSGADVQWELLCRTSPWP 672 



Database : genpeptl37 

Posted date: Sep 11, 2003 11:22 AM 
Number of letters in database: 474,463,515 
Number of sequences in database: 1,534,369 

Lambda K H 

0.318 0.133 0.404 

Gapped 

Lambda K H 
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6LAST2 Results 



http://patents.incyte.com:8000/cgi-bin/SeqServer/SeqServer 



0.270 0.0470 



0.230 



Matrix: BLOSUM62 

Gap Penalties: Existence: 11, Extension: 1 
Number of Hits to DB: 262275488 
Number of Sequences: 1534369 
Number of extensions: 10885493 
Number of successful extensions: 48330 
Number of sequences better than 10.0: 1367 
Number of HSP's better than 10.0 without gapping: 1113 
Number of HSP's successfully gapped in prelim test: 254 
Number of HSP's that attempted gapping in prelim test: 46724 
Number of HSP's gapped (non-prelim) : 1477 
length of query: 269 
length of database: 474,463,515 
effective HSP length: 57 
effective length of query: 212 
effective length of database: 387,004,482 
effective search space: 82044950184 
effective search space used: 82044950184 
T: 11 
A: 40 
XI: 16 
X2: 38 
X3: 64 
SI: 41 
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Exploring the Metabolic and Genetic Control of 
Gene Expression on a Genomic Scale 

Joseph L DeRisi, Vishwanath R. Iyer, Patrick O. Brown* 

DNA microarrays containing virtually every gene of Saccharomyces cerevisiae were used 
to carry out a comprehensive investigation of the temporal program of gene expression 
accompanying the metabolic shift from fermentation to respiration. The expression 
profiles observed for genes with known metabolic functions pointed to features of the 
metabolic reprogramming that occur during the diauxic shift, and the expression patterns 
of many previously uncharacterized genes provided clues to their possible functions. The 
same DNA microarrays were also used to identify genes whose expression was affected 
by deletion of the transcriptional co-repressor TUP1 or overexpression of the transcrip- 
tional activator YAP1. These results demonstrate the feasibility and utility of this ap- 
proach to genomewide exploration of gene expression patterns. 



Xhe complete sequences of nearly a dozen 
microbial genomes are known, and in the 
next several years we expect to know the 
complete genome sequences of several 
metazoans, including the human genome. 
Defining the role of each gene in these 
genomes will be a formidable task, and un- 
derstanding how the genome functions as a 
whole in the complex natural history of a 
living organism presents an even greater 
challenge. 

Knowing when and where a gene is 
expressed often provides a strong clue as to 
its biological role. Conversely, the pattern 
of genes expressed in a cell can provide 
detailed information about its state. Al- 
though regulation of protein abundance in 
a cell is by no means accomplished solely 
by regulation of mRNA, virtually all dif- 
ferences in cell type or state are correlated 
with changes in the mRNA levels of many 
genes. This is fortuitous because the only 
specific reagent required to measure the 
abundance of the mRNA for a specific 
gene is a cDNA sequence. DNA microar- 
rays, consisting of thousands of individual 
gene sequences printed in a high-density 
array on a glass microscope slide (J, 2), 
provide a practical and economical tool 
for studying gene expression on a very 
large scale (3-6). 

Saccharomyces cerevisiae is an especially 
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favorable organism in which to conduct a 
systematic investigation of gene expression. 
The genes are easy to recognize in the ge- 
nome sequence, cts regulatory elements are 
generally compact and close to the tran- 
scription units, much is already known 
about its genetic regulatory mechanisms, 
and a powerful set of toots is available for its 
analysis. 

A recurring cycle in the natural history 
of yeast involves a shift from anaerobic 
(fermentation) to aerobic (respiration) me- 
tabolism. Inoculation of yeast into a medi- 
um rich in sugar is followed by rapid growth 
fueled by fermentation, with the production 
of ethanol. When the fermentable sugar is 
exhausted, the yeast cells turn to ethanol as 
a carbon source for aerobic growth. This 
switch from anaerobic growth to aerobic 
respiration upon depletion of glucose, re- 
ferred to as the diauxic shift, is correlated 
with widespread changes in the expression 
of genes involved in fundamental cellular 
processes such as carbon metabolism, pro- 
tein synthesis, and carbohydrate storage 
(7). We used DNA microarrays to charac- 
terize the changes in gene expression that 
take place during this process for nearly the 
entire genome, and to investigate the ge- 
netic circuitry that regulates and executes 
this program. 

Yeast open reading frames (ORFs) were 
amplified by the polymerase chain reaction 
(PCR), with a commercially available set of 
primer pairs (8). DNA microanays, con- 
taining approximately 6400 distinct DNA 
sequences, were printed onto glass slides by 



using a simple robotic printing device (9). 
Cells from an exponentially growing culture 
of yeast were inoculated into fresh medium 
and grown at 30°C for 21 hours. After an 
initial 9 hours of growth, samples were har- 
vested at seven successive 2-hour intervals, 
and mRNA was isolated (10). Fluorescently 
labeled cDN A was prepared by reverse tran- 
scription in the presence of Cy3(green)- 
or Cy5( red) -labeled deoxyuridine triphos- 
phate (dUTP) (11) and then hybridized to 
the microarrays (12}. To maximize the re- 
liability with which changes in expression 
levels could be discerned, we labeled cDNA 
prepared from cells at each successive time 
point with Cy5, then mixed it with a Cy3- 
labeled "reference" cDNA sample prepared 
from cells harvested at the first interval 
after inoculation. In this experimental de- 
sign, the relative fluorescence intensity 
measured for the Cy3 and Cy5 fluors at 
each array element provides a reliable mea- 
sure of the relative abundance of the corre- 
sponding mRNA in the two cell popula- 
tions (Fig. 1). Data from the series of seven 
samples (Fig. 2), consisting of more than 
43,000 expression-ratio measurements, 
were organized into a database to facilitate 
efficient exploration and analysis of the 
results. This database is publicly available 
on the Internet (13). 

During exponential growth in glucose- 
rich medium, the global pattern of gene 
expression was remarkably stable. Indeed, 
when gene expression patterns between the 
first two cell samples (harvested at a 2-hour 
interval) were compared, mRNA levels dif- 
fered by a factor of 2 or more for only 19 
genes (0.3%), and the largest of these dif- 
ferences was only 2.7-fold (14). However, as 
glucose was progressively depleted from the 
growth media during the course of the ex- 
periment, a marked change was seen in the 
global pattern of gene expression. mRNA 
levels for approximately 710 genes were 
induced by a factor of at least 2, and the 
mRNA levels for approximately 1030 genes 
declined by a factor of at least 2. Messenger 
RNA levels for 183 genes increased by a 
factor of at least 4, and mRNA levels for 
203 genes diminished by a factor of at least 
4. About half of these differentially ex- 
pressed genes have no currently recognized 
function and are not yet named. Indeed, 
more than 400 of the differentially ex- 
pressed genes have no apparent homology 
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to any gene whose function is known (J 5). 
The responses of these previously unchar- 
acterized genes to the diauxic shift therefore 
provides the first small clue to their possible 
roles. 

The global view of changes in expres- 
sion of genes with known functions pro- 
vides a vivid picture of the way in which 
the cell adapts to a changing environ- 
ment. Figure 3 shows a portion of the yeast 
metabolic pathways involved in carbon 
and energy metabolism. Mapping the 
changes we observed in the mRNAs en- 
coding each enzyme onto this framework 
allowed us to infer the redirection in the 
flow of metabolites through this system. 
We observed large inductions of the genes 
coding for the enzymes aldehyde dehydro- 
genase (ALD2) and acetyl-coenzyme 
A(CoA) synthase (ACSi), which func- 
tion together to convert the products of 
alcohol dehydrogenase into acetyl -Co A, 
which in turn is used to fuel the tricarbox- 
ylic acid (TCA) cycle and the glyoxylate 
cycle. The concomitant shutdown of tran- 
scription of the genes encoding pyruvate 
decarboxylase and induction of pyruvate 
carboxylase rechannels pyruvate away 
from acetaldehyde, and instead to oxalac- 
etate, where it can serve to supply the 
TCA cycle and gluconeogenesis. Induc- 
tion of the pivotal genes PCKl, encoding 
phosphoenolpyruvate carboxykinase, and 
FBP1> encoding fructose 1,6-biphos- 
phatase, switches the directions of two key 
irreversible steps in glycolysis, reversing 
the flow of metabolites along the revers- 
ible steps of the glycolytic pathway toward 
the essential biosynthetic precursor, glu- 
coses-phosphate. Induction of the genes 
coding for the trehalose synthase and gly- 
cogen synthase complexes promotes chan- 
neling of glucose-6-phosphate into these 
carbohydrate storage pathways. 

Just as the changes in expression of 
genes encoding pivotal enzymes can pro- 
vide insight into metabolic reprogram- 
ming, the behavior of large groups of func- 
tionally related genes can provide a broad 
view of the systematic way in which the 
yeast cell adapts to a changing environ- 
ment (Fig. 4). Several classes of genes, 
such as cytochrome c-related genes and 
those involved in the TCA/glyoxylate cy- 
cle and carbohydrate storage, were coordi- 
nately induced by glucose exhaustion. In 
contrast, genes devoted to protein synthe- 
sis, including ribosomal proteins, tRNA 
synthetases, and translation, elongation, 
and initiation factors, exhibited a coordi- 
nated decrease in expression. More than 
95% of ribosomal genes showed at least 
twofold decreases in expression during the 
diauxic shift (Fig. 4) (13). A noteworthy 
and illuminating exception was that the 



genes encoding mitochondrial ribosomal 
genes were generally induced rather than 
repressed after glucose limitation, high- 
lighting the requirement for mitchondrial 
biogenesis (13). As more is learned about 
the functions of every gene in the yeast 
genome, the ability to gain insight into a 
cell's response to a changing environment 
through its global gene expression patterns 
will become increasingly powerful. 

Several distinct temporal patterns of ex- 
pression could be recognized, and sets of 
genes could be grouped on the basis of the 
similarities in their expression patterns. The 
characterized members of each of these 
groups also shared important similarities in 
their functions. Moreover, in most cases, 
common regulatory mechanisms could be 
inferred for sets of genes with similar expres- 
sion profiles. For example, seven genes 
showed a late induction profile, with mRNA 
leveb increasing by more than ninefold at 



the last timepoint but less than threefold at 
the preceding timepoint (Fig. 5B). All of 
these genes were known to be glucose-re- 
pressed, and five of the seven were previously 
noted to share a common upstream activat- 
ing sequence (UAS), the carbon source re- 
sponse element (CSRE) (16-20). A search 
in the promoter regions of the remaining two 
genes, ACR1 and IDP2, revealed that 
ACR1, a gene essential for ACS J activity, 
also possessed a consensus CSRE motif, but 
interestingly, 1DP2 did not. A search of the 
entire yeast genome sequence for the con- 
sensus CSRE motif revealed only four addi- 
tional candidate genes, none of which 
showed a similar induction. 

Examples from additional groups of 
genes that shared expression profiles are 
illustrated in Fig. 5, C through F. The 
sequences upstream of the named genes in 
Fig. 5C all contain stress response ele- 
ments (STRE), and with the exception 




Fig. 1. Yeast genome microarray. The actual size of the microarray is 18 mm by 18 mm. The 
microarray was printed as described (9). This image was obtained with the same fluorescent 
scanning confocal microscope used to collect all the data we report (49). A fluorescently labeled 
cDNA probe was prepared from mRNA isolated from cells harvested shortly after inoculation (culture 
density of <5 x 10 6 cells/ml and media glucose level of 19 g/liter) by reverse transcription in the 
presence of Cy3-dUTP. Similarly, a second probe was prepared from mRNA isolated from cells taken 
from the same culture 9.5 hours later (culture density of -2 x 10 8 cells/ml, with a glucose level of 
<0.2 g/liter) by reverse transcription in the presence of Cy5-dUTP. In this image, hybridization of the 
Cy3-dUTP-Jabe!ed cDNA (that is, mRNA expression at the initial timepoint) is represented as a green 
signal, and hybridization of Cy5-dUTP-labeled cDNA (that is, mRNA expression at 9.5 hours) is 
represented as a red signal. Thus, genes induced or repressed after the diauxic shift appear in this 
image as red and green spots, respectively. Genes expressed at roughly equal levels before and after 
the diauxic shift appear in this image as yellow spots. 



www.sciencemag.org • SCIENCE • VOL 278 • 24 OCTOBER 1997 



681 



of HSP42, have previously been shown to 
be controlled at least in part by these 
elements (21-24). Inspection of the se- 
quences upstream of HSP42 and the two 
uncharacterized genes shown in Fig. 5C, 
YKL026c, a hypothetical protein with 
similarity to glutathione peroxidase, and 
YGR043c, a putative transaldolase, re- 
vealed that each of these genes also pos- 
sess repeated upstream copies of the stress- 
responsive CCCCT motif. Of the 13 ad- 
ditional genes in the yeast genome that 
shared this expression profile [including 
HSP30, ALD2 t OM45, and 10 uncharac- 
terized ORFs (25)), nine contained one or 
more recognizable STRE sites in their up- 
stream regions. 

The heterotrimeric transcriptional acti- 
vator complex HAP2,3,4 has been shown 
to be responsible for induction of several 
genes important for respiration (26-28). 
This complex binds a degenerate consensus 
sequence known as the CCAAT box (26). 
Computer analysis, using the consensus se- 
quence TNRYTGGB (29), has suggested 
that a large number of genes involved in 
respiration may be specific targets of 
HAP2,3,4 (30). Indeed, a putative 
HAP2,3 t 4 binding site could be found in 
the sequences upstream of each of the seven 
cytochrome c-related genes that showed 
the greatest magnitude of induction (Fig. 
5D). Of 12 additional cytochrome c-related 
genes that were induced, HAP2,3,4 binding 
sites were present in all but one. Signifi- 
cantly, we found that transcription of 
HAP4 itself was induced nearly ninefold 
concomitant with the diauxic shift. 

Control of ribosomal protein biogenesis 
is mainly exerted at the transcriptional 
level, through the presence of a common 
upstream-activating element (UAS^ ) 
that is recognized by the Rapl DNA-bind- 
ing protein (31, 32). The expression pro- 
files of seven ribosomal proteins are shown 
in Fig. 5F. A search of the sequences 
upstream of all seven genes revealed con- 
sensus Rapl -binding motifs (33). It has 
been suggested that declining Rapl levels 
in the cell during starvation may be re- 
sponsible for the decline in ribosomal pro- 
tein gene expression (34). Indeed, we ob- 
served that the abundance of RAP I 
mRNA diminished by 4.4-fold, at about 
the time of glucose exhaustion. 

Of the 149 genes that encode known or 
putative transcription factors, only two, 
HAP4 and S1P4, were induced by a factor of 
more than threefold at the diauxic shift. 
S1P4 encodes a DNA-binding transcrip- 
tional activator that has been shown to 
interact with Snfl, the "master regulator" of 
glucose repression (35). The eightfold in- 
duction of S1P4 upon depletion of glucose 
strongly suggests a role in the induction of 



downstream genes at the diauxic shift. 

Although most of the transcriptional 
responses that we observed were not pre- 
viously known, the responses of many 
genes during the diauxic shift have been 
described. Comparison of the results we 
obtained by DNA microarray hybridiza- 
tion with previously reported results there- 
fore provided a strong test of the sensitiv- 
ity and accuracy of this approach. The 
expression patterns we observed for previ- 
ously characterized genes showed almost 
perfect concordance with previously pub- 
lished results (36). Moreover, the differ- 
ential expression measurements obtained 
by DNA microarray hybridization were re- 
producible in duplicate experiments. For 
example, the remarkable changes in gene 
expression between cells harvested imme- 
diately after inoculation and immediately 
after the diauxic shift (the first and sixth 
intervals in this time series) were mea- 
sured in duplicate, independent DNA mi- 
croarray hybridizations. The correlation 
coefficient for two complete sets of expres- 
sion ratio measurements was 0.87, and for 
more than 95% of the genes, the expres- 



sion ratios measured in these duplicate 
experiments differed by less than a factor 
of 2. However, in a few cases, there were 
discrepancies between our results and pre- 
vious results, pointing to technical limita- 
tions that will need to be addressed as 
DNA microarray technology advances 
(37, 38). Despite the noted exceptions, 
the high concordance between the results 
we obtained in these experiments and 
those of previous studies provides confi- 
dence in the reliability and thoroughness 
of the survey. 

The changes in gene expression during 
this diauxic shift are complex and involve 
integration of many kinds of information 
about the nutritional and metabolic state 
of the cell. The large number of genes 
whose expression is altered and the diver- 
sity of temporal expression profiles ob- 
served in this experiment highlight the 
challenge of understanding the underlying 
regulatory mechanisms. One approach to 
defining the contributions of individual 
regulatory genes to a complex program of 
this kind is to use DNA microarrays to 
identify genes whose expression is affected 



Fig. 2. The section of the ar- 
ray indicated by the gray box 
in Fig. 1 is shown for each of 
the experiments described 
here. Representative genes 
are labeled. In each of the ar- 
rays used to analyze gene 
expression during the diauxic 
shift, red spots represent 
genes that were induced rel- 
ative to the initial timepoint, 
and green spots represent 
genes that were repressed 
relative to the initial timepoint. 
In the arrays used to analyze 
the effects of the fup7A mu- 
tation and YAP1 overexpres- 
sion, red spots represent 
genes whose expression was 
increased, and green spots 
represent genes whose ex- 
pression was decreased by 
the genetic modification. Note 
that distinct sets of genes are 
induced and repressed in the 
different experiments. The 
complete images of each of 
these arrays can be viewed on 
the Internet {13). Cell density 
as measured by optical densi- 
ty (OD) at 600 nm was used to 
measure the growth of the 
culture. 
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by mutations in each putative regulatory 
gene. As a test of this strategy, we analyzed 
the genomewide changes in gene expression 
that result from deletion of the TUPl gene. 
Transcriptional repression of many genes by 
glucose requires the DNA-binding repressor 



-Trel 



Migl and is mediated by recruiting the tran- 
scriptional co-repressors Tupl and Cyc8/ 
Ssn6 (39). Tupl has also been implicated in 
repression of oxygen-regulated, mating-type- 
specific, and DNA-damage-inducible genes 
(40). 



Debranching 
6.1 




T-*-GLUr6-P 



Pentose Phosphate 
Pathway, RNA, DNA, 
Proteins 




Fig. 3. Metabolic reprogramming inferred from global analysis of changes in gene expression. Only key 
metabolic intermediates are identified. The yeast genes encoding the enzymes that catalyze each step 
in this metabolic circuit are identified by name in the boxes. The genes encoding succinyi-CoA synthase 
and glycogen-debranching enzyme have not been explicitly identified, but the ORFs YGR244 and 
YPR184 show significant homology to known succinyl-CoA synthase and glycogen-debranching en- 
zymes, respectively, and are therefore included in the corresponding steps in this figure. Red boxes with 
white lettering identify genes whose expression increases in the diauxic shift. Green boxes with dark 
green lettering identify genes whose expression diminishes in the diauxic shift. The magnitude of 
induction or repression is indicated for these genes. For multimeric enzyme complexes, such as 
succinate dehydrogenase, the indicated fold-induction represents an unweighted average of all the 
genes listed in the box. Black and white boxes indicate no significant differential expression (less than 
twofold). The direction of the arrows connecting reversible enzymatic steps indicate the direction of the 
flow of metabolic intermediates, inferred from the gene expression pattern, after the diauxic shift. Arrows 
representing steps catalyzed by genes whose expression was strongly induced are highlighted tn red. 
The broad gray arrows represent major increases in the flow of metabolites after the diauxic shift, 
inferred from the indicated changes in gene expression. 
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Wild-type yeast cells and cells bearing 
a deletion of the TUPJ gene (tupl A) were 
grown in parallel cultures in rich medium 
containing glucose as the carbon source. 
Messenger RNA was isolated from expo- 
nentially growing cells from the two pop- 
ulations and used to prepare cDNA la- 
beled with Cy3 (green) and Cy5 (red), 
respectively {11). The labeled probes were 
mixed and simultaneously hybridized to 
the microarray. Red spots on the microar- 
ray therefore represented genes whose 
transcription was induced in the tuplk 
strain, and thus presumably repressed by 
Tupl (41). A representative section of the 
microarray (Fig. 2, bottom middle panel) 
illustrates that the genes whose expression 
was affected by the tupl/\ mutation, were, 
in general, distinct from those induced 
upon glucose exhaustion [complete images 
of all the arrays shown in Fig. 2 are avail- 
able on the Internet (13)]. Nevertheless, 
34 (10%) of the genes that were induced 
by a factor of at least 2 after the diauxic 
shift were similarly induced by deletion of 
TUPl , suggesting that these genes may be 
subject to TUPl -mediated repression by 
glucose. For example, SUC2, the gene en- 
coding invertase, and all five hexose trans- 
porter genes that were induced during the 
course of the diauxic shift were similarly 
induced, in duplicate experiments, by the 
deletion of TUPl. 

The set of genes affected by Tupl in this 
experiment also included a-glucosidases, 
the mating- type-specific genes MFA1 and 
MFA2, and the DNA damage-inducible 
RNR2 and RNR4, as well as genes involved 
in flocculation and many genes of unknown 
function. The hybridization signal corre- 
sponding to expression of TUPl itself was 
also severely reduced because of the (in- 
complete) deletion of the transcription unit 
in the tuplk strain, providing a positive 
control in the experiment (42). 

Many of the transcriptional targets of 
Tupl fell into sets of genes with related 
biochemical functions. For instance, al- 
though only about 3% of all yeast genes 
appeared to be TUPl -repressed by a factor 
of more than 2 in duplicate experiments 
under these conditions, 6 of the 13 genes 
that have been implicated in flocculation 
(15) showed a reproducible increase in 
expression of at least twofold when TUPl 
was deleted. Another group of related 
genes that appeared to be subject to TUPl 
repression encodes the serine-rich cell 
wall mannoproteins, such as Tipl and 
Tirl/Srpl which are induced by cold 
shock and other stresses (43), and similar, 
serine-poor proteins, the seripauperins 
(44). Messenger RNA levels for 23 of the 
26 genes in this group were reproducibly 
elevated by at least 2.5-fold in the tupl A 
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strain, and 18 of these genes were induced 
by more than sevenfold when TUP1 was 
deleted. In contrast, none of 83 genes that 
could be classified as putative regulators of 
the cell division cycle were induced more 
than twofold by deletion of TUP J . Thus, 
despite the diversity of the regulatory sys- 
tems that employ Tupl, most of the genes 
that it regulates under these conditions 
fall into a limited number of distinct func- 
tional classes. 

Because the microarray allows us to 
monitor expression of nearly every gene in 
yeast, we can, in principle, use this ap- 
proach to identify all the transcriptional 
targets of a regulatory protein like Tupl. It 
is important to note, however, that in any 
single experiment of this kind we can only 
recognize those target genes that are nor- 
mally repressed (or induced) under the 
conditions of the experiment. For in- 
stance, the experiment described here an- 
alyzed a MAT a strain in which MFAJ 
and MFA2, the genes encoding the a- 
factor mating pheromone precursor, are 
normally repressed. In the isogenic tup J A 
strain, these genes were inappropriately 
expressed, reflecting the role that Tupl 
plays in their repression. Had we instead 
carried out this experiment with a MATA 
strain (in which expression of MFAJ and 
MFA2 is not repressed), it would not have 
been possible to conclude anything re- 
garding the role of Tupl in the repression 
of these genes. Conversely, we cannot dis- 
tinguish indirect effects of the chronic 
absence of Tupl in the mutant strain from 
effects directly attributable to its partici- 
pation in repressing the transcription of a 
gene. 

Another simple route to modulating the 
activity of a regulatory factor is to overex- 
press the gene that encodes it. YAPl en- 
codes a DNA-binding transcription factor 
belonging to the b-zip class of DNA-bind- 
ing proteins. Overexpression of YAP J in 
yeast confers increased resistance to hydro- 
gen peroxide, o-phenanthroline, heavy 
metab, and osmotic stress (45). We ana- 
lyzed differential gene expression between a 
wild-type strain bearing a control plasmid 
and a strain with a plasmid expressing YAPJ 
under the control of the strong GAL J -10 
promoter, both grown in galactose (that is, 
a condition that induces YAPl overexpres- 
sion). Complementary DNA from the con- 
trol and YAP J overexpressing strains, la- 
beled with Cy3 and Cy5, respectively, was 
prepared from mRN A isolated from the two 
strains and hybridized to the microarray. 
Thus, red spots on the array represent genes 
that were induced in the strain overexpress- 
ing YAPJ. 

Of the 17 genes whose mRN A levels 
increased by more than threefold when 



YAPl was overexpressed in this way, five 
bear homology to aryl-alcohol oxidoreduc- 
tases (Fig. 2 and Table 1). An additional 
four of the genes in this set also belong to 
the general class of dehydrogenases/oxi- 
doreductases. Very little is known about 
the role of aryl-alcohol oxidoreductases in 
S. cerevisiae, but these enzymes have been 
isolated from ligninolytic fungi, in which 
they participate in coupled redox reac- 
tions, oxidizing aromatic, and aliphatic 
unsaturated alcohols to aldehydes with the 
production of hydrogen peroxide (46, 47). 
The fact that a remarkable fraction of the 
targets identified in this experiment be- 
long to the same small, functional group of 
oxidoreductases suggests that these genes 

Fig. 4. Coordinated reg- 
ulation of functionally re- 
lated genes. The curves 
represent the average in- 
duction or repression ra- 
tios for ail the genes in 
each indicated group. 
The total number of 
genes in each group was 
as follows: ribosomal 
proteins, 112; translation 
elongation and initiation 

factors, 25; tRNA synthetases (excluding mitochondial synthetases), 17; glycogen and trehalose syn- 
thesis and degradation, 15; cytochrome c oxidase and reductase proteins, 19; and TCA- and glyoxy- 
late-cycle enzymes, 24. 

Table 1 . Genes induced by YAP1 overexpression. This list includes all the genes for which mRNA levels 
increased by more than twofold upon YAP1 overexpression in both of two duplicate experiments, and 
for which the average increase in mRNA level in the two experiments was greater than threefold (50). 
Positions of the canonical Yap1 binding sites upstream of the start codon, when present, and the 
average fold-increase in mRNA levels measured in the two experiments are indicated. 



might play an important protective role 
during oxidative stress. Transcription of a 
small number of genes was reduced in the 
strain overexpressing Yapl. Interestingly, 
many of these genes encode sugar per- 
meases or enzymes involved in inositol 
metabolism. 

We searched for Yapl -binding sites 
(TTACTAA or TGACTAA) in the se- 
quences upstream of the target genes we 
identified (48). About two-thirds of the 
genes that were induced by more than 
threefold upon Yapl overexpression had 
one or more binding sites within 600 bases 
upstream of the start codon (Table 1), sug- 
gesting that they are directly regulated by 
Yapl. The absence of canonical Yapl-bind- 
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ing sites upstream of the others may reflect 
an ability of Yapl to bind sites that differ 
from the canonical binding sites, perhaps in 
cooperation with other factors, or less like- 
ly, may represent an indirect effect of Yapl 
overexpression, mediated by one or more 
intermediary factors. Yapl sites were found 
only four times in the corresponding region 
of an arbitrary set of 30 genes that were not 
differentially regulated by Yapl. 

Use of a DNA microarray to character- 
ize the transcriptional consequences of 
mutations affecting the activity of regula- 
tory molecules provides a simple and pow- 
erful approach to dissection and character- 
ization of regulatory pathways and net- 



works. This strategy also has an important 
practical application in drug screening. 
Mutations in specific genes encoding can- 
didate drug targets can serve as surrogates 
for the ideal chemical inhibitor or modu- 
lator of their activity. DNA microarrays 
can be used to define the resulting signa- 
ture pattern of alterations in gene expres- 
sion, and then subsequently used in an 
assay to screen for compounds that repro- 
duce the desired signature pattern. 

DNA microarrays provide a simple and 
economical way to explore gene expres- 
sion patterns on a genomic scale. The 
hurdles to extending this approach to any 
other organism are minor. The equipment 



required for fabricating and using DNA 
microarrays (9) consists of components 
that were chosen for their modest cost and 
simplicity. It was feasible for a small group 
to accomplish the amplification of more 
than 6000 genes in about 4 months and, 
once the amplified gene sequences were in 
hand, only 2 days were required to print a 
set of 110 microarrays of 6400 elements 
each. Probe preparation, hybridization, 
and fluorescent imaging are also simple 
procedures. Even conceptually simple ex- 
periments, as we described here, can yield 
vast amounts of information. The value of 
the information from each experiment of 
this kind will progressively increase as 
more is learned about the functions of 
each gene and as additional experiments 
define the global changes in gene expres- 
sion in diverse other natural processes and 
genetic perturbations. Perhaps the greatest 
challenge now is to develop efficient 
methods for organizing, distributing, inter- 
preting, and extracting insights from the 
large volumes of data these experiments 
will provide. 

REFERENCES AND NOTES 



1. M. Schena, D. Shalon, R. W. Davis, P. O. Brown, 
Science 270, 467 (1995). 

2. D. ShaJon, S. J. Smith. P. O. Brown, Genome Res. 6. 
639 (1996). 

3. D. Lashkari, Proc. Natl. Acad Sci. U.S.A. in press. 

4. J. DeRisi ef a/., Nature Genet. 14, 457 (1996). 

5. D. J. Lockhart et al.. Nature Biotechnol. 14, 1675 
(1996). 

6. M. Chee et al.. Science 274, 610 (1996). 

7. M. Johnston and M. Carlson, in The Molecular Biol- 
ogy of the Yeast Saccharomyces: Gene Expression. 
E. W. Jones, J. R. Pringle, J. R, Broach, Eds. (Cold 
Spring Harbor Laboratory Press. Cold Spring Har- 
bor, NY, 1992), p. 193. 

8. Primers for each known or predicted protein coding 
sequence were supplied by Research Genetics. 
PCR was performed with the protocol supplied by 
Research Genetics, using genomic DNA from yeast 
strain S288C as a template. Each PCR product was 
verified by agarose gel electrophoresis and was 
deemed correct if the lane contained a single band of 
appropriate mobility. Failures were marked as such 
in the database. The overall success rate for a single- 
pass amplification of 61 16 ORFs was -94.5%. 

9. Glass slides (Gold Seal) were cleaned for 2 hours in a 
solution of 2 N NaOH and 70% ethanol. After rinsing 
in distilled water, the slides were then treated with a 
1:5 dilution of poty-L-lysine adhesive solution (Sig- 
ma) for 1 hour, and then dried for 5 min at 40°C in a 
vacuum oven. DNA samples from 1 00- PCR reac- 
tions were purified by ethanol purification in 96-well 
microtiter plates. The resulting precipitates were re- 
suspended in 3x standard saline citrate (SSC) and 
transferred to new plates for arraying. A custom-built 
arraying robot was used to print on a batch of 1 10 
slides. Details of the design of the microarrayer are 
available at cmgm.stanford.edu/pbrown. After print- 
ing, the microarrays were rehydrated for 30 s in a 
humid chamber and then snap-dried for 2 s on a hot 
plate (100°C). The DNA was then ultraviolet (UV)- 
crossfinked to the surface by subjecting the slides to 
60 mJ of energy (Stratagene Stratalinker). The rest of 
the pofy-L-lysine surface was blocked by a 15-min 
incubation in a solution of 70 mM succinic anhydride 
dissolved in a solution consisting of 315 ml of 1- 
methyl-2-pyrrolidinone (Aldrich) and 35 ml of 1 M 
boric acid (pH 8.0). Directly after the blocking reac- 



A B 




Time (hours) 

Fig. 5. Distinct temporal patterns of induction or repression help to group genes that share regulatory 
properties. (A) Temporal profile of the cell density, as measured by OD at 600 nm and glucose 
concentration in the media. (B) Seven genes exhibited a strong induction {greater than ninefold) only at 
the last timepoint (20.5 hours). With the exception of IDP2, each of these genes has a CSRE UAS. There 
were no additional genes observed to match this profile. (C) Seven members of a class of genes marked 
by early induction with a peak in mRNA levels at 18.5 hours. Each of these genes contain STRE motif 
repeats in their upstream promoter regions. (D) Cytochrome c oxidase and ubiquinol cytochrome c 
reductase genes. Marked by an induction coincident with the diauxic shift, each of these genes contains 
a consensus binding motif for the HAP2,3,4 protein complex. At least 1 7 genes shared a similar 
expression profile. (E) SAM1, GPP 7, and several genes of unknown function are repressed before the 
diauxic shift, and continue to be repressed upon entry into stationary phase. (F) Ribosomal protein 
genes comprise a large class of genes that are repressed upon depletion of glucose. Each of the genes 
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We describe here a method for drug target validation and identification of secondary drug tar- 
get effects based on genome-wide gene expression patterns. The method is demonstrated by 
several experiments, including treatment of yeast mutant strains defective in calcineurin, im- 
munophilins or other genes with the immunosuppressants cyclosporin A or FK506. Presence or 
absence of the characteristic drug 'signature' pattern of altered gene expression in drug-treated 
cells with a mutation in the gene encoding a putative target established whether that target was 
required to generate the drug signature. Drug dependent effects were seen in 'target I ess' cells, 
showing that FK506 affects additional pathways independent of calcineurin and the im- 
munophilins. The described method permits the direct confirmation of drug targets and recog- 
nition of drug-dependent changes in gene expression that are modulated through pathways 
distinct from the drug's intended target. Such a method may prove useful in improving the effi- 
ciency of drug development programs. 



Good drugs are potent and specific; that is, they must have 
strong effects on a specific biological pathway and minimal ef- 
fects on all other pathways. Confirmation that a compound in- 
hibits the intended target (drug target validation) and the 
identification of undesirable secondary effects are among the 
main challenges in developing new drugs. Comprehensive 
methods that enable researchers to determine which genes or 
activities are affected by a given drug might improve the effi- 
ciency of the drug discovery process by quickly identifying po- 
tential protein targets, or by accelerating the identification of 
compounds likely to be toxic. DNA microarray technology, 
which permits simultaneous measurement of the expression 
levels of thousands of genes, provides a comprehensive frame- 
work to determine how a compound affects cellular metabolism 
and regulation on a genomic scale 1 "". DNA microarrays that 
contain essentially every open reading frame (ORF) in the 
Saccharomyces cerevisiae genome have already been used success- 
fully to explore the changes in gene expression that accompany 
large changes in cellular metabolism or cell cycle progression 7 * 10 . 

In the modern drug discovery paradigm, which typically be- 
gins with the selection of a single molecular target, the ideal in- 
hibitory drug is one that Inhibits a single gene product so 
completely and so specifically that it is as if the gene product 
were absent. Treating cells with such a drug should induce 
changes in gene expression very similar to those resulting from 
deleting the gene encoding the drug's target. Here we have com- 
pared the genome-wide effects on gene expression that result 
from deletions of various genes in the budding yeast 5. cerevisiae 
to the effects on gene expression that result from treatment 



with known inhibitors of those gene products. Using the cal- 
cineurin signaling pathway as a model system, we tested an ap- 
proach that permits identification of genes that encode proteins 
specifically involved in pathways affected by a drug. The FK506 
characteristic pattern, or 'signature', of altered gene expression 
was not observed in mutant cells lacking proteins inhibited by 
FK506 (for example, a calcineurin or FK506-binding-protein 
mutant strain), but was observed in mutants deleted for genes 
in pathways unrelated to FK506 action (for example, a cy- 
clophilin mutant strain). Conversely, the cyclosporin A (CsA) 
signature was not observed in CsA- treated calcineurin or cy- 
clophilin mutant strains, but was seen in an FK506-binding-pro- 
tein mutant strain treated with CsA. The method also 
demonstrates that FK506, a clinically used immunosuppressant, 
has 'off-target* effects that are independent of its binding to im- 
munophilins. Thus, the approach we describe may provide a 
way to identify the pathways altered by a drug and to detect 
drug effects mediated through unintended targets. 

Null mutants phenocopy drug-treated cells on a genomic scale 
To test whether a null mutation in a drug target serves as a 
model of an Ideal inhibitory drug, we examined the effects on 
gene expression associated with pharmacological or genetic in- 
hibition of calcineurin function. Calcineurin is a highly con- 
served calcium- and calmodulin-activated serine/threonine 
protein phosphatase implicated in diverse processes dependent 
on calcium signaling 12 "' 3 . In budding yeast, calcineurin is re- 
quired for intracellular ion homeostasis 14 , for adaptation to pro- 
longed mating pheromone treatment 15 and in the regulation of 
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Fig. 1 Model of antagonism of the calclneurin signaling pathway mediated 
by FK506 and cyclosporin A (CsA). Calcineurin activity is composed of a cat- 
alytic subunit (calcineurin A, encoded in yeast by the CNA 7 and CNA2 genes), 
and calcium- binding regulatory subunits calmodulin (CMD) and calcineurin B 
(CnB). After entering cells, FK506 and CsA specifically bind and inhibit the 
peptidyl-proline isomerase activity of their respective immunophilins, FK506 
binding proteins (FKBP) and cyclophilins (CyP). The most abundant im- 
munophilins in yeast (Fpr1 and Cphl) are thought to mediate calcineurin in- 
hibition. Drug-immunophilin complexes bind and inhibit the calcium- and 
cafmodulin-stimulated phosphatase calcineurin. Among the substrates of cal- 
cineurin are transcriptional activators that act to modulate gene expression. 



the onset of mitosis 16 . In mammals, calcineurin has been impli- 
cated in T-cell activation' 2 , in apoptosis 17 , in cardiac hypertro- 
phy 18 and in the transition from short-term to long-term 
memory' 9 . In both organisms, calcineurin activity is inhibited 
by FK506 and CsA, immunosuppressant drugs whose effects on 
calcineurin are mediated through families of intracellular recep- 
tor proteins called immunophilins' 2 20 (Fig. 1). To assess the ef- 
fects of pharmacologic inhibition of calcineurin, wild-type 5. 
cerevisiae was grown to early logarithmic phase in the presence 
or absence of FK506 or CsA. Isogenic cells, from which the 
genes encoding the catalytic subunits of calcineurin (CNA1 and 
CNA2) had been deleted 21 (referred to as the cna or calcineurin 
mutant), were grown in parallel, in the absence of the drug. 
Fluorescently-labeled cDNA was prepared by reverse transcrip- 
tion of polyA* RNA in the presence of Cy3- or Cy5-deoxynu- 
cleotide triphosphates and then hybridized to a microarray 
containing more than 6.000 DNA probes representing 97% of 
the known or predicted ORFs in the yeast genome. 
Simultaneous hybridization of Cy5-Iabeled cDNA from mock- 
treated cells and Cy3-labeled cDNA from cells treated with 1 
ug/ml FK506 allowed the effect of drug treatment on mRNA lev- 
els of each ORF to be determined (Fig. 2a and b and data not 
shown). Similarly, effects of the calcineurin mutations on the 
mRNA levels of each gene were assessed by simultaneous hy- 
bridization of Cy5-labeled cDNA from wild-type cells and Cy3- 
labeled cDNA from the calcineurin mutant strain (Fig. 2c). For 
each comparison of this kind, reported expression ratios are the 
average of at least two hybridizations in which the Cy3 and Cy5 
fluors were reversed to remove biases that may be introduced by 
gene-specific differences in incorporation of the two fluors 
(data not shown). 

Treatment with FK506 in these growth conditions resulted in 
a signature pattern of altered gene expression in which mRNA 
levels of 36 ORFs changed by more than twofold 
(http://www.rosetta.org). A very similar pattern of altered gene 
expression was observed when the calcineurin mutant strain 
was compared to wild-type cells. Comparison of the changes in 
mRNA expression of each gene resulting from treatment of 
wild-type cells with FK506 with mRNA expression changes re- 
sulting from deletion of the calcineurin genes showed the con- 
siderable similarity of the global transcript alterations in 
response to the two perturbations (Fig. 26-d). Quantification of 
this similarity using the correlation coefficient (p) showed 
large correlations between the FK506 treatment signature and 
the calcineurin deletion signature (p = 0.75 ± 0.03), as well as 
the CsA treatment signature (p = 0.94±0.02), but not with a 
randomly selected deletion mutant strain (deleted for the 
YER071C gene; p » -0.07 ± 0.04; Fig. 2e). The FK506 treatment 
signature was also compared with those of more than 40 other 
deletion mutant strains or drug-treatments thought to affect 
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unrelated pathways, and none had statistically significant cor- 
relations. These data establish that genetic disruption of cal- 
cineurin function provides a close and specific phenocopy of 
treatment with FK506 or CsA. 

To avoid generalizing from a single example, we also com- 
pared the effects of treatment of wild-type cells with 3-aminotri- 
azole (3-AT) with the effects of deletion of the HIS3 gene. HIS3 
encodes imidazoleglycerol phosphate dehydratase, which cat- 
alyzes the seventh step of the histidine biosynthetic pathway in 
yeast 22 ; 3-AT is a competitive inhibitor of this enzyme that trig- 
gers a large transcriptional amino-acid starvation response 23 . 
Microarray analysis of wild-type and isogenic his3-deficient 
strains demonstrated the expected large genome-wide transcrip- 
tional responses (involving more than 1,000 ORFs) resulting 
from treatment with 3-AT (Fig. 3a) or from HIS3 deletion (Fig. 
3c). Quantitative comparison of the 3-AT treatment signature 
and the his3 mutant signature showed a high level of correlation 
(p= 0.76 ± 0.02) that even extended to genes that experienced 
small changes in expression level (Fig. 3d). As a negative control 
the correlations between the 3-AT treatment signature or the 
bis3 mutant signature and the calcineurin mutant strain were 
not statistically significant (p = 0.09 ± 0.06 and -0.01 ± 0.04, re- 
spectively). That both the calcineurin/FK506 and the /iis3/3-AT 
comparisons were highly correlated indicates that in many cases 
the expression profile resulting from a gene deletion closely re- 
sembles the expression profile of wild-type cells treated with an 
inhibitor of that gene's product. 

'Decoder' strategy: Drug target validation with deletion mutants 

Because pharmacological inhibition of different targets might 
give similar or identical expression profiles, simple comparison 
of drug signatures to mutant signatures is unlikely to unambigu- 
ously identify a drug's target. To overcome this limitation, an 
additional 'decoder' step is used. We first compare the expres- 
sion profile of wild-type drug-treated cells to the expression pro- 
files from a panel of genetic mutant strains, using a correlation 
coefficient metric. Mutant strains whose expression profile is 
similar to that of drug-treated wild-type cells are selected and 
subjected to drug treatment, generating the drug signature in 
the mutant strain (that is, the mutant drug signature). If the 
mutated gene encodes a protein involved in a pathway affected 
by the drug, we expect the drug signature in mutant cells to be 
different (or absent, for an ideal drug) from the drug signature 
seen in wild-type cells. 
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Fig. 2 Expression profiles from 
FK506-treated wild-type (wt) 
cells and a calcineurin-disruption 
mutant strain share a genome- 
wide correlation. DNA microarray 
analysis showing changes in gene 
expression resulting from FK506 
treatment (a and o) or from ge- 
netic disruption of genes encod- 
ing calcineurin (c). a. Pseudo- 
color image of the results of si- 
multaneous hybridization of Cy5- 
labeled cDIMA (red) from 
mock-treated strain R563 and Cy3-labeled cDNA 
(green) from strain R563 treated with 1 ng/ml FK506. 
b, Enlarged view of the boxed area in a. Arrowheads in- 
dicate specific ORFs induced or repressed, c. Pseudo- 
color image of the results of simultaneous hybridization 
of Cy5-labeled cDNA (red) from strain R563 and Cy3- 
labeled cDNA (green) from strain MCY300 (deleted for 
the CNA1.CNA2 catalytic subunits of calcineurin). 
Arrows indicate specific ORFs induced or repressed, d, 
The log 10 of the expression ratio for each ORF derived 
from the FK506 treatment hybridizations is plotted ver- 
sus the log 10 of the expression ratio in the calcineurin 
mutant hybridizations. ORFs that were induced or re- 
pressed in both experiments are shown as green and 
red dots, respectively, e, The log,„ of the expression ratio for each ORF de 
rived from the FK506 treatment hybridizations is plotted versus the log„ 
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of the expression ratio in the yer071c mutant hybridizations. No ORFs 
were induced or repressed in both experiments. 



To illustrate this, we treated the his3 mutant strain with 3- 
AT. The signature pattern of altered gene expression resulting 
from treatment of the mutant strain with 3-AT was much less 
complex than that of the 3-AT signature in wild-type cells (Fig. 
4). This is seen simply by examining plots of mean intensity of 
the hybridization signal (which approximately reflects level of 
expression) versus the expression ratio for each ORF (Fig. 4). 
Genes that were expressed at higher or lower levels in 3-AT 
treated cells or in his3 mutant cells are shown as red and green 
dots, respectively. We analyzed the 3-AT signature in wild-type 
(Fig. 4a) and his3 mutant cells (Fig. 4c), as well as the his3 mu- 
tant strain signature (Fig. 4b). Whereas histidine limitation in- 
duced by 3-AT induced more than 1,000 transcription-level 
changes in the wild-type strain, few or no transcript level 
changes were induced by treatment of the hls3- deletion strain 
with 3-AT. This indicates that with the growth conditions used, 
essentially all of the effects of 3-AT depend on or are mediated 
through the HIS3 gene product. 

Applying this approach to the calcineurin signaling pathway 
showed the specificity of the method. The calcineurin mutant 
strain and strains with deletions in the genes encoding the 
most abundant immunophilins in yeast 12 (CPH1 and FPR1) 
were treated with either FK506 or CsA to determine the profiles 



Table 1 Signature correlation of expression ratios as a result of FK506 
treatment in various mutant strains 
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+/- FK506 


0.93 ± 0.04 


-0.01 ± 0.07 


-0.23 ± 0.07 


0.12 ±0.07 


0.79 ± 0.03 



Signature correlation shows the absence of the FK506 signature specifically in the calcineurin [cna) and fprl 
(major FK506 binding protein) deletion mutants, cna represents the mutant with deletions of the catalytic sub- 
units of calcineurin, CAM 7 and CNA2. The correlation coefficient reported in the first column represents the cor- 
relation between two pairs of hybridizations from independent wild-type +/- FK506 experiments. 



of altered gene expression resulting from drug treatment of the 
mutant cells (that is, mutant +/- drug). We compared the drug 
signatures in the mutants to the wild-type drug signature using 
the correlation coefficient metric (Table 1). Although the signa- 
ture generated by treatment of wild-type cells with FK506 was 
highly correlated to the calcineurin mutant strain signature (p 
= 0.75 ± 0.03), it bore no similarity to the profile after treat- 
ment of the calcineurin mutant strain with FK506 (p = -0.01 ± 
0.07). This indicates that FK506 was unable to elicit its normal 
transcriptional response in the calcineurin mutant strain. 
Likewise, treatment of the fprl mutant strain with FK506 
elicited an expression profile that was not correlated to the 
FK506 signature in the wild-type strain (p = -0.23 ± 0.07), indi- 
cating that the FPR1 gene product is likely to be involved in the 
pathway affected by FK506. The same was true for the cna fprl 
mutant strain. In contrast, treatment of the cphl mutant strain 
with FK506 generated an expression profile highly correlated 
with the wild-type FK506 expression profile (p = 0.79 ± 0.03), 
indicating the cphl mutation did not block the mode of action 
of FK506 and thus is not directly involved in the pathway af- 
fected by FK506. We tabulated the change in expression in re- 
sponse to FK506 in different mutant strains for all ORFs with 
expression ratios greater than 1 .8 in FK506-treated cells or in 
the calcineurin mutant strain (Fig. 5a) .The 
calcineurin mutant strain signature and the 
FK506 responses in wild-type and the cphl 
mutant strain are similar, and there are no 
transcript-level changes (seen in black) for 
treatment of the calcineurin, fprl and cna 
fprl mutant strains with FK506 (Fig. 5a). 

Similar experiments and analyses with CsA 
provided further validation of this approach. 
The expression profile elicited by treatment 
of wild-type cells with CsA was highly corre- 
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Fig. 3 Expression profiles 
from a his3 mutant strain 
and wild-type (wt) cells 
treated with 3-AT share a 
genome-wide correlation. 
DNA microarray analysis 
showing changes in gene 
expression resulting from 3- 
AT treatment (a) or from ge- 
netic disruption of the HIS3 
gene (c). a. Pseudo-color 
image of the results of simul- 
taneous hybridization of 

Cy5-labeled cDNA (red) from mock-treated wild-type strain R491 and 
Cy3-labeled cDNA (green) from strain R491 treated with 10 mM 3-AT. 
b, Plot of the log 10 of the expression ratio for each ORF derived from the 
3-AT treatment hybridizations is plotted versus the log, 0 of the expression 
ratio in the his3 mutant hybridizations. ORFs that were induced or re- 
pressed in both experiments are shown as green and red dots, respec- 
tively. The correlation of expression ratios applies not only to genes with 
large expression ratios (for example, CHA 7 and ARG1), but also extends to 
genes with expression ratios less than 2 (for example. ILV1 and CPH1). 
ILV1 is induced 1.9-fold and 1.5-fold, and CP HI is downregulated 1 .9-fold 
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and 1.7-fold, in cells treated with 3-AT and his3 mutant cells, respectively. 
Two ORFs do not fall on the line x * y. The leftmost point is the HiS3 data 
point, which is induced by 3-AT treatment but which is not absent from 
the his3 mutant strain. The other point is YOR203w. Both data points are 
labeled HIS3 because hybridization to YOR203w is most likely due to HIS3 
mRNA. as YOR203w overlaps the HIS3 open reading frame, c. Pseudo- 
color image of the results of simultaneous hybridization of Cy5-labeled 
cDNA (red) from wild-type strain R491 and Cy3-labeled cDNA (green) 
from strain R1226, deleted for the HIS3 gene. Arrowheads indicate spe- 
cific ORFs induced or repressed. 



lated to the profile elicited by mutation of the calcineurin genes 
(p = 0.7 1 ± 0.04). but did not correlate with the expression pro- 
file resulting from treatment of the calcineurin mutant strain 
with CsA (p = -0.05 ± 0.07; Table 2), indicating that the genetic 
deletion of calcineurin interfered with the ability of CsA to 
elicit its normal transcriptional response. Likewise, the CsA sig- 
nature was essentially absent in CsA-treated cphl mutant cells, 
and the expression profile of CsA-treated cphl mutant cells cor- 
related poorly to that of CsA-treated wild-type cells (p = 0.18 ± 
0.07). Thus, the CPH1 gene product was required for the CsA re- 
sponse seen in wild-type cells. Conversely, treatment of fprl 
mutant cells with CsA resulted in an expression pattern very 
similar to the profile of CsA-treated wild-type cells (p = 0.77 ± 
0.03), indicating that FPR1 was not necessary for the CsA-medi- 
ated effects. Analysis of individual ORFs affected by CsA and 
their expression ratios over the entire set of experiments con- 
firmed that CPH1 and the genes encoding calcineurin, but not 



FPR1, are necessary for the wild-type CsA response (Fig. 56). The 
observation that the profiles resulting from FK506 or CsA drug 
treatment are similar to that of the calcineurin deletion mutant 
strain might allow the prediction that calcineurin was involved 
in the pathway affected by these drugs. But because the expres- 
sion profile of the fprl mutant strain did not bear a strong simi- 
larity to the wild-type drug expression profile for FK506, it is 
obvious that the drug treatment of the mutant strains was nec- 
essary to identify Fprl, but not Cphl, as a potential FK506 drug 
target. In the same way, the 'decoder' strategy was necessary to 
identify Cphl, but not Fprl, as a potential drug target for CsA. 

'Decoder' approach can identify secondary drug effects 
For a drug that has a single biochemical target, the strategy out- 
lined above may be useful in target validation. In many cases, 
however, a compound may affect multiple pathways and elicit 
a very complex signature. 'Decoding* such a complex signature 
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Fig. 4 Treatment of the his3 mutant strain with 3-AT shows nearly com- 
plete loss of 3-AT signature. A plot of the log 10 of the mean intensity of hy- 
bridization for each ORF versus the log l0 of its expression ratio for each 
experiment is shown next to a pseudo-color image of a representative 
portion of the microarray. ORFs that are induced or repressed at the 95% 
confidence level are shown in green and red, respectively, a, Expression 
profile from treatment of the wild-type (wt) strain with 3-AT. Cy5-labeled 
cDNA (red) from mock-treated strain R491 and Cy3-labeled cDNA 
(green) from strain R491 treated with 10 mM 3-AT. b. Expression profile 



from the his3 deletion strain. Cy5-labeled cDNA (red) from strain R491 
and Cy3-labeled cDNA (green) from strain R1226, deleted for the HIS3 
gene, e, Expression profile of treatment of the his3 deletion strain with 3- 
AT. Cy3-labeled cDNA (red) from rws3-deleted strain R1226 and Cy5-la- 
beled cDNA (green) from strain R1226 treated with 10 mM 3-AT. 
Arrowheads indicate the DNA probe and data point corresponding to the 
HIS3 gene. The blue dashed line represents the threshold below which er- 
rors tend to increase rapidly because spot intensities are not sufficiently 
above background intensity. 
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Table 2 Signature correlation of expression ratios as a result of CsA 
treatment in various mutant strains 





wild-type 


cna 


fprl 


cna cphl 


cph1 




+/-CsA 


+/-CsA 


+/-CsA 


+/-CsA 


+/-CsA 


wild-type 












+/-CsA 


0.94 ± 0.04 


-0.05 ± .07 


0.77 ± 0.03 


-0.11 ±0.07 


0.18*0.07 



Strain: 



Signature correlation shows the absence of the CsA signature specifically in the calcineurin (cna) and cph1 
(cyclophilin) deletion mutants, cna represents the mutant with deletions of the catalytic subunits of cal- 
cineurin. CNA 1 and CNA2. The correlation coefficient reported in the first column represents the correlation 
between two pairs of hybridizations from independent wild-type +/- CsA experiments. 



into the effects mediated through the intended target (the *on- 
target signature') and those mediated through unintended tar- 
gets (the 'off-target* signature) might be useful in evaluating a 
compound's specificity. Our decoder' strategy is based on the 
premise that 'off-target' signature should be insensitive to the 
genetic disruption of the primary target. 

To determine whether the 'decoder" approach could identify 
an 'off-target' profile, we looked for a drug-responsive gene 
whose expression is insensitive to deletion of the primary tar- 
get. To increase the likelihood of observing such genes, the 
same strains described in Tables 1 and 2 were treated with 
higher concentrations (50 ug/ml) of FK506. This led to a much 
more complex expression profile in wild-type cells, indicating 
that at this higher concentration, FK506 was inhibiting or acti- 
vating additional targets. Several of the ORFs in this expanded 
FK506-induced expression profile were not affected by the cal- 
cineurin, cphl or fprl mutations, as drug treatment of these mu- 
tant strains did not block their presence in the FK506 
expression signature (Fig. 6). This indicates that FK506 was trig- 
gering changes in transcript levels of many genes through path- 
ways independent of calcineurin, CPH1 and FPRL Many of the 
upregulated ORFs in the 'off- target' pathway were genes re- 
ported to be regulated by the transcriptional activator Gcn4 
(ref. 24). In some strains, a reporter gene under GCN4 control 
was induced in response to FK506 treatment 25 . To determine 
whether GCN4 is involved in this pathway that is independent 
of calcineurin, CPH1 and FPRl, we analyzed the effects of treat- 
ment with high-dose FK506 on global gene expression in a 
strain with a GCN4 deletion (Fig. 6). Of the 41 ORFs with cal- 
cineurin-independent expression ratios greater than 4, 32 were 
not induced in the gcn4 mutant, indicating that their induction 
by FK506 was CC/V4-dependent. Not all CC7V4-reguiated genes 
were induced by FK506. This FK506-induced subset of GCN4- 
regulated genes may be those most sensitive to subtle changes 
in Gcn4 levels, or perhaps other regulatory circuits prevent 
FK506 activation of some GCAM-regulated genes. Seven of the 
remaining nine ORFs. induced by FK506 were independent of 

Fig. 5 Response of FK506 and CsA signature genes in strains with deletions 
in different genes. Genes with expression ratios greater than a factor of 1 .8 in 
response to treatment with 1 ug/ml FK506 (a) or 50 ug/ml CsA (b) are listed 
(left side) and their expression ratios in the indicated strain are shown on the 
green (induction)-red (repression) color scale, a, Calcineurin (cna) mutant 
and FK506 treatment signature genes are in the first two columns. Almost all 
FK506 signature genes have expression ratios near unity in deletion strains 
involved in pathways affected by FK506 (calcineurin, fprl and cna fprl mu- 
tants) but not in deletion strains in unrelated pathways (cphl). b, Calcineurin 
(cna) mutant and CsA treatment signature genes are in the first two 
columns. Almost all CsA signature genes have expression ratios near unity in 
deletion strains involved in pathways affected by CsA (calcineurin, cphl and 
cna cphl mutants) but not in deletion strains in unrelated pathways (fprl). 



both the calcineurin and GCN4 pathways. The 
simplest explanation is that FK506 inhibits or 
activates additional pathways. Members of this 
class include SNQ2 and PDR5, genes that en- 
code drug efflux pumps with structural homol- 
ogy to mammalian multiple drug resistance 
proteins 20 . FK506 may interact directly with 
Pdr5 to inhibit its function 27 . Our results indi- 
cate that treatment with FK506 leads to four- 
fold -to-sixfold induction of PDR5 mRNA levels. 
YOR1, another gene that can confer drug resis- 
tance, is also induced threefold-to-fourfold by 
FK506. Thus, drug treatment of strains with mutations in the 
primary targets can prove useful in identifying effects mediated 
by secondary drug targets, including the nature and extent of 
newly discovered and previously unsuspected pathways af- 
fected by the drug. 

We describe here a method for drug target validation and the 
identification of secondary drug target effects that uses DNA mi- 
croarrays to survey the effects of drugs on global gene expres- 
sion patterns. We established that genetic and pharmacologic 
inhibition of gene function can result in extremely similar 
changes in gene expression. We also demonstrated that one can 
confirm a potential drug target by treating a deletion mutant 
defective in the gene encoding the putative target. Drug-medi- 
ated signatures from strains with mutations in pathways or 
processes directly or indirectly affected by the drug bore little or 
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no similarity to the wild-type drug expression profile. In con- 
trast, drug-mediated signatures from strains with mutations in 
genes involved in pathways unrelated to the drug's action 
showed extensive similarity to the wild-type drug signature. By 
applying this approach to a drug that affects multiple pathways 
(FK506), we were able to decode a complex signature into com- 
ponent parts, including the identification of an 'off-target' sig- 
nature that was mediated through pathways independent of 
calcineurin or the Fprl immunophilin. 

Discussion 

It is well-established that high-throughput biochemical screen- 
ing can identify potent inhibitory compounds against a given 
target. The 'decoder' approach described here complements 
this process by evaluating the equally important property of 
specificity: the tendency of a compound to inhibit pathways 
other than that of its intended target. The ability to observe 
such 'off-target' effects will likely be useful in several ways. 
Profiling compounds with known toxicities will allow the de- 
velopment of a database of expression changes associated with 
particular toxicities. Recognition of potential toxicities in the 
'off-target' signatures of otherwise promising compounds then 
may allow earlier identification of those likely to fail in clinical 
trials. Comparing the extent and peculiarities of 'off-target' sig- 
natures of promising drug candiates could provide a new way 
to group compounds by their effects on secondary pathways, 
even before those effects are understood. This may prove to be 
an alternative, potentially more effective, way to select com- 
pounds for animal and clinical trials. Some drugs are more ef- 
fective against a related protein than against the originally 
intended target. Sildenafil (Viagra™), for example, was initially 
developed as a phosphodiesterase inhibitor to control cardiac 
contractility, but was found to be highly specific for phospho- 
diesterase 5, an isozyme whose inhibition overcomes defects in 



Fig. 6 Response of FK506 signature genes in strains with deletions 
in different genes. Genes with expression ratios greater than a factor 
of 4 in at least one experiment are listed and their expression ratios in 
the indicated strain are shown in the green (induction)-red (repres- 
sion) color scale. The genes have been divided into classes corre- 
sponding to these expected behaviors: 'C/VAdependent' genes 
respond to FK506 (50 ug/ml) except when either calcineurin genes or 
FPR1 or both are deleted; 'GCA/4-dependent' genes respond to FK506 
except when GCN4 is deleted. These genes still respond to FK506 
when calcineurin genes or FPR1 or CPH1 are deleted; that is, their re- 
sponses are not mediated by calcineurin, Cph1, or Fprl. 'CNA- and 
GCN4- independent' genes respond to FK506 in all deletion strains 
tested. A 'complex behavior' class is provided for those genes that did 
not match the model of FK506 response mediated through cal- 
cineurin or Fprl or separately through Gcn4. 



penile erection. It is possible that application of the 'de- 
coder' to other compounds may show that they too have a 
potent activity against a target distinct from their in- 
tended target. 

The ability to decode drug effects is dependent on the 
availability of functionally 'targetless' cells. In yeast, this 
is being achieved by systematically disrupting each yeast 
gene {Saccharomyces Deletion Consortium; http://se- 
| quence-www.stanford ,edu/group/yeast_deletion_pro- 

ject/deletion.html). Efforts are underway to obtain 
■■ expression profiles from each deletion mutant strain. 
Determining signatures resulting from inactivation of es- 
sential genes presents a unique problem, but it may be 
possible to do so by examining heterozygotes or by using a con- 
trollable promoter to reduce expression of the essential gene. 
Although it is already feasible to test several compounds in 
dozens of yeast strains, another challenge for the 'decoder' 
strategy will be the efficient selection of the mutants with dele- 
tions in genes most likely to encode the intended drug target. 
The signature correlation plots described are one metric that 
could be used as part of that selection process, but others need 
to be explored. Applying the 'decoder' to mammalian cells pre- 
sents additional challenges. It is considerably more difficult to 
isolate functionally 'targetless' cells. Strategies involving titrat- 
able promoters, known specific inhibitors, anti-sense RNAs, ri- 
bozymes. and methods of targeting specific proteins for 
degradation are possible and should be tested. Another limita- 
tion is that not all cell types express the same set of genes and 
therefore 'off-target' effects may be different in different cell 
types. In addition, applying the decoder' to human cells will 
also require technical improvements that allow expression pro- 
filing from a small number of cells. Even the broader question 
of whether the insensitivity of 'off-target' signatures to the dis- 
ruption of the main target is the exception or the rule can only 
be answered by the accumulation of more data. Barkai and 
Leibler, however, have argued in favor of robustness of biologi- 
cal networks, indicating that drug perturbations ('off-target' 
signatures) may be robust even when the system is subjected to 
another perturbation (such as a genetic disruption) (ref. 28). 
Many practical developments will be necessary if the 'decoder' 
concept is to be broadly applied. 

Expression arrays have been used mainly as an initial screen 
for genes induced in a particular tissue or process of interest by 
focusing on genes with large expression ratios. We have 
found, however, that effort to refine experimental protocols 
and repeat experiments increases the reliability of the data and 
permits new applications. For example, it provides a larger set 
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Table 3 Yeast strains used 



Strain 


Relevant genotype 


Reference 


YPH499 


Mata ura3-52 Iys2-801 ade2-101 trpl-A63 his3-A200 Ieu2-A1 


(34) 


R563 


Mata ura3-52 tys2-801 ade2-101 trp1-A63 his3-A200 Ieu2-A1 his3::H!S3 


(this study) 


R558 


Mata ura3-52 tys2-801 ade2-101 trp1-A63 his3~A200 Ieu2-A1 fpr1::HIS3 


(this study) 


R567 


Mata ura3-52 lys2-801 ade2-101 trp1-A63 his3-A200 Ieu2-Al cph1::H!S3 


(this study) 


MCY300 


Mata ura3-52 ly$2-801 ade2-101 trp1-A63 his3-A200 Ieu2-Al cna1Al::hisGcna2Al::HIS3 


(21) 


R132 


Mataura3-52tys2-801 ade2-101 trp1-A63 his3-A200leu2-A1 cnalAl::hisG cna2A1::HIS3 cph1::karf 


(this study) 


R133 


Mata ura3-52tys2-801 ade2-101 trp1-A63 h\s3-A20O teu2-Al cna1A1::hi$G cna2A1::HIS3 fpr1::karf 


(this study) 


R559 


Mata ura3-52 Iys2-801 ade2-101 trp1-A63 his3-A200 Ieu2-A1 hi$3::HIS3 gcn4::LEU2 


(this study) 


BY4719 


Mata trp1-A63ura3-A0 


(35) 


BY4738 


Mata trp1-A 63 ura3-A0 


(35) 


R491 


Mara/crBY4719XBY4738 


(this study) 


BY4728 


Mata his3-A200 trp1-A63 ura3-A0 


(35) 


BY4729 


Mata his3-A 200 trp 1 -A63 ura3-A0 


(35) 


R1226 


Mata/a BY4728 X BY4729 


(this study) 



of genes at higher confidence levels that serve as a more 
unique signature for a given protein perturbation. In addition, 
g it allows subtle signatures to be detected, when, for example, a 
8 protein is only partially inhibited. This may enable clinical 
2 monitoring of small changes in protein function in disease or 
« toxicity states before they could otherwise be detected, 
g Because the functions of many genes detected on transcript ar- 
il rays are known, these microarrays are powerful tools that pro- 
| vide detailed information about a cell's physiology. For 
|S example, changes in the flux through a metabolic pathway are 
g reflected in transcriptional changes in genes in the pathway 7 . 
•. Furthermore, it may be possible to indirectly measure protein 
£ activity levels from expression profiling data (S.F., et a/., un- 
S published data). Thus, although the eventual development of 
I genomic methods allowing the direct measurement of all cel- 
< lular protein levels will be an important achievement, tran- 
| script array technology offers an immediate and robust means 
J? of evaluating the effects of various treatments on gene expres- 
oo sion and protein function. 
o> 

$ Methods 

Construction, growth and drug treatment of yeast strains. The strains 
used in this study (Table 3) were constructed by standard techniques 29 . 
To construct strain R559, strain R563 was transformed to Leu* with plas- 
mid pM12 digested by Sa/I and MliA (provided by A. Hinnebusch and T. 
Dever). Strains R1 32 and R133 were constructed by transforming the bac- 
terial kanamycin resistance cassette 30 flanked by genomic DNA from the 
CP HI and FPR1 loci, respectively, and selecting for G418-resistant 
colonies. For experiments with FK506, cells were grown for three genera- 
tions to a density of 1 x 10 7 cells/ml in YAPD medium (YPD plus 0.004% 
adenine) supplemented with 10 mM calcium chloride as described 31 . 
Where indicated, FK506 was added to a final concentration of 1 uxj/ml 
0.5 h after inoculation of the culture or to 50 jig/ml 1 h before cells were 
collected. CsA was used at a final concentration of 50 jig/ml. Cells were 
broken by standard procedures 32 with the following modifications: Cell 
pellets were resuspended in breaking buffer (0.2 M Tris HCI pH 7.6, 0.5 M 
NaCI, 10 mM EDTA. 1% SDS), vortexed for 2 min on a VWR multi-tube 
vortexer at setting 8 in the presence of 60% glass beads (425-600 pun 
mesh; Sigma) and phenolxhloroform (50:50, volume/volume). After sep- 
aration of the phases, the aqueous phase was re-extracted and ethanol- 
precipitated. Poly A* RNA was isolated by two sequential 
chromatographic purifications over oligo dT cellulose (New England 
Biolabs, Beverly, Massachusetts) using established protocols". 

For experiments using 3-AT, wild-type or Ns3/his3 cells were grown to 
early logarithmic phase in SC medium, pelleted and resuspended in SC 
medium lacking histidine for 1 hr in the presence or absence of 10 mM 3- 



AT. as indicated. Cells were harvested and mRNA isolated as above. 
FK506 was obtained from the Swedish Hospital Pharmacy (Seattle, 
Washington) and purified to homogeneity by ethyl acetate extraction by 
J. Simon (Fred Hutchinson Cancer Research Center, Seattle, Washington). 
CsA was obtained from Alexis Biochemicals (San Diego, California); 3-AT 
was from Sigma. 

Preparation and hybridization of the labeled sample. Fluorescently-la- 
beled cDNA was prepared, purified and hybridized essentially as de- 
scribed 7 . Cy3- or Cy5-dUTP (Amersham) was incorporated into cDNA 
during reverse transcription (Superscript II; Life Technologies) and puri- 
fied by concentrating to less than 10 ul using Microcon-30 microconcen- 
trators (Amicon, Houston, Texas). Paired cDNAs were resuspended in 
20-26 ul hybridization solution (3 x SSC, 0.75 ng/ml polyA DNA, 0.2% 
SDS) and applied to the microarray under a 22- x 30-mm coverslip for 6 
h at 63 "C, all according to a published method 7 . 

Fabrication and scanning of microarrays. PCR products containing 
common 5' and 3' sequences (Research Genetics, Huntsville, Alabama) 
were used as templates with amino-modified forward primer and unmod- 
ified reverse primers to PCR amplify 6,065 ORFs from the S. cerevisiae 
genome. Our first- pass success rate was 94%. Amplification reactions that 
gave products of unexpected sizes were excluded from subsequent analy- 
sis. ORFs that could not be amplified from purchased templates were am- 
plified from genomic DNA. DNA samples from 100-u.l reactions were 
isopropanol-precipitated, resuspended in water, brought to a final con- 
centration of 3x SSC in a total volume of 15 pi, and transferred to 384- 
well microtiter plates (Genetix Limited, Christchurch. Dorset. England). 
PCR products were spotted onto 1 x 3- inch poly lysine- treated glass slides 
by a robot built essentially according to defined specifications 357 
(http://cmgm.stanford.edu/pbrown/MGuide). After being printed, slides 
were processed according to published protocols 7 . 

Microarrays were imaged on a prototype multi-frame CCD camera in 
development at Applied Precision (Issaquah, Washington). Each CCD 
image frame was approximately 2-mm square. Exposure times of 2 s in 
the Cy5 channel (white light through Chroma 618-648 nm excitation fil- 
ter. Chroma 657-727 nm emission filter) and 1 s in the Cy3 channel 
(Chroma 535-560 nm excitation filter, Chroma 570-620 nm emission fil- 
ter) were done consecutively in each frame before moving to the next, 
spatially contiguous frame. Color isolation between the Cy3 and Cy5 
channels was about 100:1 or better. Frames were 'knitted' together in 
software to make the complete images. The intensity of spots (about 100 
urn) were quantified from the 10-jim pixels by frame- by- frame back- 
ground subtraction and intensity averaging in each channel. Dynamic 
range of the resulting spot intensities was typically a ratio of 1.000 be- 
tween the brightest spots arid the background-subtracted additive error 
level. Normalization between the channels was accomplished by normal- 
izing each channel to the mean intensities of all genes. This procedure is 
nearly equivalent to normalization between channels using the intensity 
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ratio of genomic DfvJA spots 7 , but is possibly more robust, as it is based on 
the intensities of several thousand spots distributed over the array. 

Signature correlation coefficients and their confidence limits. 
Correlation coefficients between the signature ORFs of various experi- 
ments were calculated using: 

p ft 1 )" 1 
k k k 

where x„ is the log 10 of the expression ratio for the k* gene in the x signa- 
ture, and y k is the !og 10 of the expression ratio for the k* gene in the y sig- 
nature. The summation is over those genes that were either up- or 
down-regulated in either experiment at the 95% confidence level. These 
genes each had a less than 5% chance of being actually unregulated (hav- 
ing expression ratios departing from unity due to measurement errors 
alone). This confidence level was assigned based on an error model which 
assigns a lognormal probability distribution to each gene's expression 
ratio with characteristic width based on the observed scatter in its re- 
peated measurements (repeated arrays at the same nominal experimental 
conditions) and on the individual array hybridization quality. This latter 
dependence was derived from control experiments in which both Cy3 
and CyS samples were derived from the same RNA sample. For large 
numbers of repeated measurements the error reduces to the observed 

E scatter. For a single measurement the error is based on the array quality 

8 and the spot intensity. 

S? Random measurement errors in the x and y signatures tend to bias the 
■jg correlation towards zero. In most experiments, most genes are not signif- 
Jj icantly affected but do show small random measurement errors. Selecting 
c only the '95% confidence' genes for the correlation calculation, rather 
=5 than the entire genome, reduces this bias and makes the actual biological 
§ correlations more apparent. 

^ Correlations between a profile and itself are unity by definition. Error 
*| limits on the correlation are 95% confidence limits based on the individ- 
• ual measurement error bars, and assuming uncorrelated errors 33 . They do 
c not include the bias mentioned above; thus, a departure of p from unity 
eo does not necessarily mean that the underlying biological correlation is im- 
■c perfect. However, a correlation of 0.7 ± 0.1, for example, is very signifi- 
E cantly different from zero. Small (magnitude of p < 0.2) but formally 
^ significant correlation in the tables and text probably are due to small sys- 
=j tematic biases in the Cy5/Cy3 ratios that violate the assumption of inde- 
pendent measurement errors used to generate the 95% confidence 
oo limits. Therefore, these small correlation values should be treated as not 
o> significant. A likely source of uncorrected systematic bias is the partially 
«« corrected scanner detector nonlinearity that differently affects the Cy3 
and Cy5 detection channels. 

The 1 ng/ml FK506 treatment signature was compared with more 
than 40 unrelated deletion mutant strain or drug signatures. These con- 
trol profiles had correlation coefficients with the FK506 profile that were 
distributed around zero (mean p = -0.03) with a standard deviation of 
0.16 (data not shown), and none had correlations greater than p = 0.38. 
Similarly, the calcineurin mutant strain signature correlated well with the 
CsA treatment signature (p = 0.71 ± 0.04) but not with the signatures 
from the negative controls (mean p » -0.02 with a standard deviation of 
0.18). 

Quality controls. End-to-end checks on expression ratio measurement 
accuracy were provided by analyzing the variance in repeated hybridiza- 
tions using the same mRNA labeled with both Cy3 and Cy5. and also 
using Cy3 and Cy5 mRNA samples isolated from independent cultures of 
the same nominal strain and conditions. Biases undetected with this pro- 
cedure, such as gene-specific biases presumably due to differential incor- 
poration of Cy3- and Cy5-dUTP into cDNA, were minimized by doing 
hybridizations in fluor-reversed pairs, in which the Cy3/Cy5 labeling of 
the biological conditions was reversed in one experiment with respect to 
the other. The expression ratio for each gene is then the ratio of ratios be- 
tween the two experiments in the pair. Other biases are removed by algo- 
rithmic numerical de-trending. The magnitude of these biases in the 
absence of de-trending and fluor reversal is typically about 30% in the 
ratio, but may be as high as twofold for some ORFs. 

Expression ratios are based on mean intensities over each spot. Some 



smaller spots have fewer image pixels in the average. This does not de- 
grade accuracy noticeably until the number of pixels falls below ten, in 
which case the spot is rejected from the data set. 'Wander' of spot posi- 
tions with respect to the nominal grid is adaptively tracked in array sub- 
regions by the image processing software. Unequal spot 'wander' within 
a subregion greater than half-a-spot spacing is a difficulty for the auto- 
mated quantitating algorithms; in this case, the spot is rejected from 
analysis based on human inspection of the 'wander'. Any spots partially 
overlapping are excluded from the data set. Less than 1% of spots typi- 
cally are rejected for these reasons. 
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The temporal program of gene expression during a model physiological re- 
sponse of human cells, the response of fibroblasts to serum, was explored with 
a complementary DNA microarray representing about 8600 different human 
genes. Genes could be clustered into groups on the basis of their temporal 
patterns of expression in this program. Many features of the transcriptional 
program appeared to be related to the physiology of wound repair, suggesting 
that fibroblasts play a larger and richer role in this complex multicellular 
response than had previously been appreciated. 



The response of mammalian fibroblasts to 
serum has been used as a model for studying 
growth control and cell cycle progression (7). 
Normal human fibroblasts require growth 
factors for proliferation in culture; these 
growth factors are usually provided by fetal 
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bovine serum (FBS). In the absence of 
growth factors, fibroblasts enter a nondivid- 
ing state, termed G 0 , characterized by low 



metabolic activity. Addition of FBS or puri- 
fied growth factors induces proliferation of 
the fibroblasts; the changes in gene expres- 
sion that accompany this proliferative re- 
sponse have been the subject of many studies, 
and the responses of dozens of genes to se- 
rum have been characterized. 

We took a fresh look at the response of 
human fibroblasts to serum, using cDNA mi- 
croarrays representing about 8600 distinct hu- 
man genes to observe the temporal program of 
transcription that underlies this response. Pri- 
mary cultured fibroblasts from human neonatal 
foreskin were induced to enter a quiescent state 
by serum deprivation for 48 hours and then 
stimulated by addition of medium containing 
10% FBS (2). DNA microarray hybridization 
was used to measure the temporal changes in 
mRNA levels of 8613 human genes (J) at 12 
times, ranging from 15 min to 24 hours after 
serum stimulation. The cDNA made from pu- 
rified mRNA from each sample was labeled 
with the fluorescent dye Cy5 and mixed with a 
common reference probe consisting of cDNA 
made from purified mRNA from the quiescent 



B 



Fig. 1. The same section of 
the microarray is shown 
for three independent hy- 
bridizations comparing RNA 
isolated at the 8-hour time 
point after serum treat- 
ment to RNA from serum- 
deprived cells. Each mi- 
croarray contained 9996 
elements, induding 9804 
human cDNAs, represent- 
ing 8613 different genes. 
mRNA from serum-de- 
prived cells was used to 
prepare cDNA labeled with 
Cy3-deoxyuridine triphosphate (dUTP), and mRNA harvested from cells at different times after serum 
stimulation was used to prepare cDNA labeled with Cy5-dUTP. The two cDNA probes were mixed and 
simultaneously hybridized to the microarray. The image of the subsequent scan shows genes whose 
mRNAs are more abundant in the serum-deprived fibroblasts (that is, suppressed by serum treatment) 
as green spots and genes whose mRNAs are more abundant in the serum-treated fibroblasts as red 
spots. Yellow spots represent genes whose expression does not vary substantially between the two 
samples. The arrows indicate the spots representing the following genes: 1, protein disulfide isomerase- 
related protein P5; 2, 11-8 precursor; 3, EST AA057170; and 4, vascular endothelial growth factor. 
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culture (time zero) labeled with a second fluo- 
rescent dye, Cy3 (4). The color images of the 
hybridization results (Fig. 1) were made by 
representing the Cy3 fluorescent image as 
green and the Cy5 fluorescent image as red and 
merging the two color images. 

Diverse temporal profiles of gene expres- 
sion could be seen among the 8613 genes sur- 



veyed in this experiment (Fig. 2); many of these 
genes (about half) were unnamed expressed 
sequence tags (ESTs) (5). Although diverse 
patterns of expression were observed, the order- 
ly choreography of the expression program be- 
came apparent when the results were analyzed 
by a clustering and display method developed 
in our laboratory for analyzing genome-wide 



gene expression data (tf). An example of such 
an analysis, here applied to a subset of 517 
genes whose expression changed substantially 
in response to serum (7), is shown in Fig. 2. 
The entire detailed data set underlying Fig. 
2 is available as a tab-delimited table (in 
cluster order) at the Science Web site (www. 
sciencemag.org/feature/data/984559.shl). In 
addition, the entire, larger data set for the 
complete set of genes analyzed in this exper- 
iment can be found at a Web site maintained 
by our laboratory (genome-www.stanford. 
edu/serum) (8). 

One measure of the reliability of the 
changes we observed is inherent in the ex- 
pression profiles of the genes. For most genes 
whose expression levels changed, we could 
see a gradual change over a few time points, 
which thus effectively provided independent 
measurements for almost all of the observa- 
tions. An additional check was provided by 
the inclusion of duplicate and, in a few cases, 
multiple array elements representing the 
same gene for about 5% of the genes included 
in this microarray. In addition, three indepen- 
dent hybridizations to different microarrays 
with mRNA samples from cells harvested 8 
hours after serum addition showed good cor- 
relation (Fig. 1). As an independent test, we 
measured the expression levels of several 
genes using the TaqMan 5' nuclease fluori- 
genic quantitative polymerase chain reaction 
(PCR) assay (9). The expression profiles of 
the genes, as measured by these two indepen- 
dent methods, were very similar (Fig. 3) (10). 

The transcriptional response of fibroblasts 
to serum was extremely rapid. The immediate 
response to serum stimulation was dominated 
by genes that encode transcription factors 
and other proteins involved in signal trans- 
duction. The mRNAs for several genes [in- 
cluding c-FOS, JUN B, and mitogen-acti- 
vated protein (MAP) kinase phosphatase- 1 
(MKP1)] were detectably induced within 
15 min after serum stimulation (Fig. 4, A 
and B). Fifteen of the genes that were 
observed to be induced by serum encode 
known or suspected regulators of transcrip- 
tion (Fig. 4B). All but one were immediate- 
early genes — their induction was not inhib- 
ited by cycloheximide (7/). This class of 
genes could be distinguished into those 
whose induction was transient (Fig. 2, clus- 
ter E) and those whose mRNA levels re- 
mained induced for much longer (Fig. 2, 
clusters I and J). Some features of the 
immediate response appeared to be directed 
at adaptation to the initiating signals. We 
observed a marked induction of mRNA 
encoding MKP1, a dual-specificity phos- 
phatase that modulates the activity of the 
ERKI and ERK2 MAP kinases {12). The 
coincidence of the peak of expression of 
genes in cluster E (Fig. 2) with that of 
MKP1 (Fig. 4A) suggests the possibility 



Fig. 2. Ouster image 
showing the different 
classes of gene expres- 
sion profiles. Five hun- 
dred seventeen genes 
whose mRNA levels 
changed in response to 
serum stimulation were 
selected (7). This sub- 
set of genes was clus- 
tered hierarchically into 
groups on the basis of 
the similarity of their 
expression profiles by 
the procedure of Eisen 
et ai (6). The expres- 
sion pattern of each 
gene in this set is dis- 
played here as a hori- 
zontal strip. For each 
gene, the ratio of 
mRNA levels in fibro- 
blasts at the indicat- 
ed time after serum 
stimulation ("unsync" 
denotes exponentially 
growing cells) to its 
level in the serum-de- 
prived (time zero) fi- 
broblasts is represented 
by a color, according to 
the color scale at the 
bottom. The graphs 
show the average ex- 
pression profiles for the 
genes in the corre- 
sponding "cluster" (in- 
dicated by the letters A 
to J and color coding). 
In every case examined, 
when a gene was rep- 
resented by more than 
one array element, the 
multiple representa- 
tions in this set were 
seen to have identical 
or very similar expres- 
sion profiles, and the 
profiles corresponding 
to these independent 
measurements clus- 
tered either adjacent 
or very dose to each 
other, pointing to the 
robustness of the clus- 
tering algorithm in 
grouping genes with 
very similar patterns of 
expression. 
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that continued activity of the MAP kinase path- 
way is required to maintain induction of these 
genes but not of those with sustained expression 
(clusters I and J). The gene encoding a second 
member of the dual-specificity MAP kinase 
phosphatase family, known as dual-specificity 
protein phosphatase 6/pyst2, was induced later, 
at about 4 hours after serum stimulation. Genes 
encoding diverse other proteins with roles in 
signal transduction, ranging from cell-surface 
receptors [for example, the sphingosine 1- 
phosphate receptor (EDG-1), the vascular en- 
dothelial growth factor receptor, and the type II 
BMP receptor] to regulators of G-protein sig- 
naling (for example, NETl/pl 15 rho GEF) to 
DNA-binding transcription factors, were in- 
duced by serum (Fig. 4A). 

The reprogramming of the regulatory cir- 
cuits in response to serum involved not only 
induction of transcription factors but also re- 
duced expression of many transcriptional reg- 
ulators — some of which may play roles in 
maintaining the cells in G 0 or in priming 
them to react to wounding (Fig. 4C). Perhaps 
as a consequence of the historical focus on 
genes induced by serum stimulation of fibro- 
blasts, the set of transcription factors whose 
expression diminished upon serum stimula- 
tion has been less well characterized. 

Genes known or likely to be involved in 
controlling and mediating the proliferative re- 
sponse showed distinctive patterns of regula- 
tion. Several genes whose products inhibit pro- 
gression of the cell-division cycle, such as p27 
Kipl , p57 Kip2, and pi 8, were expressed in the 
quiescent fibroblasts and down-regulated be- 
fore the onset of cell division. The nadir in the 
mRNA levels for these genes occurred between 
6 and 12 hours after serum stimulation (Fig. 
5A), coincident with the passage of the fibro- 
blasts through G|. The levels of the transcript 
encoding the WEE 1 -like protein kinase, which 
is believed to inhibit mitosis by phosphoryl- 
ation of Cdc2, diminished between 4 and 8 to 
12 hours after serum addition (Fig. 5A), well 



before the onset of M phase at around 16 hours, 
raising the possibility of an additional role for 
Weel in an earlier stage of the cell cycle or in 
regulating the G 0 to G, transition. Several 
genes induced in the first few hours after serum 
stimulation, such as the helix-loop-helix pro- 
teins ID2 and ID3 and EST AAO 16305, a gene 
with homology to G r S cyclins, are candidates 
for roles in promoting the exit from Gq. 

Genes involved in mediating progression 
through the cell cycle were characterized by a 
distinctive pattern of expression (Fig. 2, clus- 
ter D), reflecting the coincidence of their 
expression with the reentry of the stimulated 
fibroblasts into the cell-division cycle. The 
stimulated fibroblasts replicated their DNA 
about 16 hours after serum treatment. This 
timing was reflected by the induction of 
mRNA encoding both subunits of ribonucle- 
otide reductase and PCNA, the processivity 
factor for DNA polymerase epsilon and delta. 
Cyclin A, Cyclin Bl, Cdc2, and CDC28 ki- 
nase, regulators of passage through the S 
phase and the transition from G 2 to M phase, 
were induced at about 16 to 20 hours after 
serum addition. The kinase in the Cyclin 
Bl-CDK pair needs to be activated by phos- 
phorylation. The gene encoding Cycl in-de- 
pendent kinase 7 (CDK7; a homolog of Xe- 
nopus MO 15 cdk-activating kinase) was in- 
duced in parallel with the Cdc2 and Cdc28 
kinases (Fig. 5A), suggesting a potential role 
for CDK7 in mediating M phase. DNA topo- 
isomerase II a, required for chromosome seg- 
regation at mitosis; Mad2, a component of 
the spindle checkpoint that prevents comple- 
tion of mitosis (anaphase) if chromosomes 
are not attached to the spindle; and the kinet- 
ochore protein CENP-F all showed a similar 
expression profile. 

In the hours after the serum stimulus, one of 
the most striking features of the unfolding tran- 
scriptional program was the appearance of nu- 
merous genes with known roles in processes 
relevant to the physiology of wound healing. 



These included both genes involved in the di- 
rect role played by fibroblasts in remodeling of 
the clot and the extracellular matrix and, more 
notably, genes encoding proteins involved in 
intercellular signaling (Fig. 5). Genes induced 
in this program encode products that can (i) 
participate in the dynamic process of clotting, 
clot dissolution, and remodeling and perhaps 
contribute to hemostasis by promoting local 
vasoconstriction (for example, endothelin-1); 
(ii) promote chemotaxis and activation of neu- 
trophils (for example, COX2) and recruitment 
and extravasation of monocytes and macro- 
phages (for example, MCP1); (iii) promote 
chemotaxis and activation of T lymphocytes 
[for example, interleukin-8 (IL-8)] and B 
lymphocytes (for example, ICAM-1), thus 
providing both innate and antigen-specific 
defenses against wound infection and recruit- 
ing the phagocytic cells that will be required 
to clear out the debris during remodeling of 
the wound; (iv) promote angiogenesis and 
neovascularization (for example, VEGF) 
through newly forming tissue; (v) promote 
migration and proliferation of fibroblasts (for 
example, CTGF) and their differentiation into 
myofibroblasts (for example, Vimentin); and 
(vi) promote migration and proliferation of 
keratinocytes, leading to reepithelialization 
of the wound (for example, FGF7), and pro- 
mote proliferation of melanocytes, perhaps 
contributing to wound hyperpigmentation 
(for example, FGF2). 

Coordinated regulation of groups of genes 
whose products act at different steps in a 
common process was a recurring theme. For 
example, Furin, a prohormone-processing 
protease required for one of the processing 
steps in the generation of active endothelin, 
was induced in parallel with induction of the 
gene encoding the precursor of endothelin-1 
(Fig. 5E) (13). Conversely, expression of 
CALL A/CD 10, a membrane metalloprotease 
that degrades endothelin-1 and other peptide 
mediators of acute inflammation, was re- 
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Fig. 3. Independent verification of microarray quantitation. Relative mRNA 
levels of the indicated genes (Mast, mast/stem cell growth factor receptor) 
were measured with the TaqMan 5' nuclease fluorigenic quantitative PCR 
assay (9) (left) in the same samples that were used to prepare probes for 
microarray hybridizations (right). Data from the TaqMan analysis were 



normalized to mRNA concentrations and plotted relative to the level at 
time zero, so that the results could be compared with those from the 
microarray hybridizations. In general, quantitation with the two methods 
gave very similar results (70). 
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duced. A second example is provided by a set 
of five genes involved in the biosynthesis of 
cholesterol (Fig. 51). The mRNAs encoding 
each of these enzymes showed sharply dimin- 
ished expression beginning 4 to 6 hours after 
serum stimulation of fibroblasts. A likely ex- 
planation for the coordinated down-regula- 
tion of the cholesterol biosynthetic pathway 
is that serum provides cholesterol to fibro- 
blasts through low-density lipoproteins, 
whereas in the absence of the cholesterol 
provided by serum, endogenous cholesterol 
biosynthesis in fibroblasts is required. 

Many of the previously studied genes that 
we observed to be regulated in this program 
have no recognized role in any aspect of wound 
healing or fibroblast proliferation. Their identi- 
fication in this study may therefore point to 
previously unknown aspects of these processes. 
A few selected genes in this group are shown in 
Fig. 5R The stanniocalcin gene, for example 
(Fig. 5H), encodes a secreted protein without a 
clearly identified function in human cells (/</, 
IS). Its induction in serum-stimulated fibro- 
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Fig. 4. "Reprogramming" of fibroblasts. Expres- 
sion profiles of genes whose function is likely to 
play a role in the reprogramming phase of the 
response are shown with the same representa- 
tion as in Fig. 2. In the cases in which a gene 
was represented by more than one element in 
the microarray, all measurements are shown. 
The genes were grouped into categories on the 
basis of our knowledge of their most likely role. 
Some genes with pleiotropic roles were includ- 
ed in more than one category. 



blasts suggests the possibility that it may play a 
role in the wound-healing process, perhaps 
serving as a signal in mediating inflammation 
or angiogenesis. 

One of the most important results of this 
exploration was the discovery of over 200 pre- 
viously unknown genes whose expression was 
regulated in specific temporal patterns during 
the response of fibroblasts to serum. For exam- 
ple, 1 3 of the 40 genes in cluster D (Fig. 2) have 
descriptive names that reflect their putative 
function. Nine of these 13 genes (69%) encode 
proteins that play roles in cell cycle progres- 
sion, particularly in DNA replication and the 
G 2 -M transition. This enrichment for cell 
cycle-related genes suggests that some of the 
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unnamed genes in this cluster — for example, 
EST W79311 and EST R13146, neither of 
which have sequence similarity to previously 
characterized genes — may represent previously 
unknown genes involved in this part of the cell 
cycle. Similarly, a remarkable fraction of genes 
that were grouped into cluster F on the basis of 
their expression profiles encoded proteins in- 
volved in intercellular signaling (Fig. 2), sug- 
gesting that a similar role should be considered 
for the many unnamed genes in this cluster. A 
disproportionately large fraction of the genes 
whose transcription diminished upon serum 
stimulation were unnamed ESTs. 

Our intention was to use this experiment as 
a model to study the control of the transition 
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Fig. 5. The transcriptional response to serum suggests a multifaceted role for fibroblasts in the 
physiology of wound healing. The features of the transcriptional program of fibroblasts in response 
to serum stimulation that appear to be related to various aspects of the wound-healing process and 
fibroblast proliferation are shown with the same convention for representing changes in transcript 
levels as was used in Figs. 2 and 4. (A) Cell cycle and proliferation, (B) coagulation and hemostasis, 
(C) inflammation, (D) angiogenesis, (E) tissue remodeling, (F) cytoskeletai reorganization, (C) 
reepithelialization, (H) unidentified role in wound healing, and (I) cholesterol biosynthesis. The 
numbers in (C) and (C) refer to genes whose products serve as signals to neutrophils (C1), 
monocytes and macrophages (C2), T lymphocytes (C3), B lymphocytes (C4), and melanocytes (G1). 
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from G 0 to a proliferating state. However, one 
of the defining characteristics of genome-scale 
expression profiling experiments is that the ex- 
amination of so many diverse genes opens a 
window on all the processes that actually occur 
and not merely the single process one intended 
to observe. Serum, the soluble fraction of clot- 
ted blood, is normally encountered by cells in 
vivo in the context of a wound. Indeed, the 
expression program that we observed in re- 
sponse to serum suggests that fibroblasts are 
programmed to interpret the abrupt exposure to 
serum not as a general mitogenic stimulus but 
as a specific physiological signal, signifying a 
wound. The proliferative response that we orig- 
inally intended to study appeared to be part of a 
larger physiological response of fibroblasts to a 
wound. Other features of the transcriptional 
response to serum suggest that the fibroblast is 
an active participant in a conversation among 
the diverse cells that work together in wound 
repair, interpreting, amplifying, modifying, and 
broadcasting signals controlling inflammation, 
angiogenesis, and epithelial regrowth during 
the response to an injury. 

We recognize that these in vitro results 
almost certainly represent a distorted and in- 
complete rendering of the normal physiolog- 
ical response of a fibroblast to a wound. 
Moreover, only the responses elicited directly 
by exposure of fibroblasts to serum were 
examined. The subsequent signals from other 
cellular participants in the normal wound- 
healing process would certainly provoke fur- 
ther evolution of the transcriptional program 
in fibroblasts at the site of a wound, which 
this experiment cannot reveal. Nevertheless, 
we believe that the picture that emerged 
strongly suggests a much larger and richer 
role for the fibroblast in the orchestration of 
this important physiological process than had 
previously been suspected. 
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Systematic variation in gene expression 
patterns in human cancer cell lines 

Douglas T. Ross 1 , Uwe Scherf 5 , Michael B. Eisen 2 , Charles M. Perou 2 , Christian Rees 2 , Paul Spellman 2 , 
Vishwanath Iyer 1 , Stefanie S. Jeffrey 3 , Matt Van de Rijn 4 , Mark Waltham 5 , Alexander Pergamenschikov 2 , 
Jeffrey C.R Lee 6 , Deval Lashkari 7 , Dari Shalon 6 , Timothy G. Myers 8 , John N. Weinstein 5 , David Botstein 2 
& Patrick O.Brown 1 ' 9 

We used cDNA microarrays to explore the variation in expression of approximately 8,000 unique genes among the 
60 cell lines used in the National Cancer Institute's screen for anti-cancer drugs. Classification of the cell lines based 
solely on the observed patterns of gene expression revealed a correspondence to the ostensible origins of the 
tumours from which the cell lines were derived. The consistent relationship between the gene expression patterns 
and the tissue of origin allowed us to recognize outliers whose previous classification appeared incorrect. Specific 
features of the gene expression patterns appeared to be related to physiological properties of the cell lines, such 
as their doubling time in culture, drug metabolism or the interferon response. Comparison of gene expression pat- 
terns in the cell lines to those observed in normal breast tissue or in breast tumour specimens revealed features of 
the expression patterns in the tumours that had recognizable counterparts in specific cell lines, reflecting the 
tumour, stromal and inflammatory components of the tumour tissue. These results provided a novel molecular 
characterization of this important group of human cell lines and their relationships to tumours in vivo. 



Introduction 

Cell lines derived from human tumours have been extensively used 
as experimental models of neoplastic disease. Although such cell 
lines differ from both normal and cancerous tissue, the inaccessi- 
bility of human tumours and normal tissue makes it likely that 
such cell lines will continue to be used as experimental models for 
the foreseeable future. The National Cancer Institute's Develop- 
mental Therapeutics Program (DTP) has carried out intensive 
studies of 60 cancer cell lines (the NCI60) derived from tumours 
from a variety of tissues and organs 1 *^. The DTP has assessed many 
molecular features of the cells related to cancer and chemothera- 
peutic sensitivity, and has measured the sensitivities of these 60 cell 
lines to more than 70,000 different chemical compounds, includ- 
ing all common chemotherapeutics (http://dtp.nci.nih.gov). A 
previous analysis of these data revealed a connection between the 
pattern of activity of a drug and its method of action. In particular, 
there was a tendency for groups of drugs with similar patterns of 
activity to have related methods of action 3,5 " 7 . 

We used DNA microarrays to survey the variation in abun- 
dance of approximately 8,000 distinct human transcripts in these 
60 cell lines. Because of the logical connection between the func- 
tion of a gene and its pattern of expression, the correlation of gene 
expression patterns with the variation in the phenotype of the cell 
can begin the process by which the function of a gene can be 
inferred. Similarly, the patterns of expression of known genes can 



reveal novel phenotypic aspects of the cells and tissues studied 8 " 10 . 
Here we present an analysis of the observed patterns of gene 
expression and their relationship to phenotypic properties of the 
60 cell lines. The accompanying report 1 1 explores the relationship 
between the gene expression patterns and the drug sensitivity pro- 
files measured by the DTP. The assessment of gene expression pat- 
terns in a multitude of cell and tissue types, such as the diverse set 
of cell lines we studied here, under diverse conditions in vitro and 
in vivo, should lead to increasingly detailed maps of the human 
gene expression program and provide clues as to the physiological 
roles of uncharacterized genes 11 " 16 . The databases, plus tools for 
analysis and visualization of the data, are available (http://genome- 
www.stanford.edu/nci60 and http://discover.nci.nih.gov). 

Results 

We studied gene expression in the 60 cell lines using DNA 
microarrays prepared by robotically spotting 9,703 human 
cDNAs on glass microscope slides 17,18 . The cDNAs included 
approximately 8,000 different genes: approximately 3,700 repre- 
sented previously characterized human proteins, an additional 
1,900 had homologues in other organisms and the remaining 
2,400 were identified only by ESTs. Due to ambiguity of the iden- 
tity of the cDNA clones used in these studies, we estimated that 
approximately 80% of the genes in these experiments were cor- 
rectly identified. The identities of approximately 3,000 cDNAs 
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Fig. 1 Gene expression patterns related to the tissue of origin of the cell lines. Two-dimen- 
sional hierarchical clustering was applied to expression data from a set of 1,161 cDNAs 
measured across 54 cell lines. The 1,161 cDNAs were those (of 9,703 total) with transcript 
levels that varied by at least sevenfold (log 2 (ratio) >2.8) relative to the reference pool in at 
least 4 of 60 cell lines. This effectively selected genes with the greatest variation in expres- 
sion level across the 60 cell lines (including those genes not well represented in the refer- 
ence pool), and therefore highlighted those gene expression patterns that best 
distinguished the cell lines from one another. Data from 64 hybridizations were used, one 
for each cell line plus the two additional independent representations of each of the cell 
lines K562 and MCF7. The two cell lines represented in triplicate were correspondingly 
weighted for the gene clustering so that each of the 60 cell lines contributed equally to the 
clustering, a. The cell-line dendrogram, with the terminal branches coloured to reflect the 
ostensible tissue of origin of the cell line (red, leukaemia; green, colon; pink, breast; pur- 
ple, prostate; light blue, lung; orange, ovarian; yellow, renal; grey, CNS; brown, melanoma; 
black, unknown (NCI/ADR-RES)). The scale to the right of the dendrogram depicts the cor- 
relation coefficient represented by the length of the dendrogram branches connecting 
pairs of nodes. Note that the two triplets of replicated cell lines (K562 and MCF7) cluster 
tightly together and were well differentiated from even the most closely related cell lines, 
indicating that this clustering of cell lines is based on characteristic variations in their gene 
expression patterns rather than artefacts of the experimental procedures, b, A coloured 
representation of the data table, with the rows (genes) and columns (cell lines) in cluster 
order. The dendrogram representing hierarchical relationships between genes was omit- 
ted for clarity, but is available (http://genorne-www.stanford.edu/nci60). The colour in each 
cell of this table reflects the mean-adjusted expression level of the gene (row) and cell line 
(column). The colour scale used to represent the expression ratios is shown. The labels 
*3a-3d* in (b) refer to the clusters of genes shown in detail in Fig. 3. 
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from these experiments have been sequence- verified, including 
all of those referred to here by name. 

Each hybridization compared Cy5-labelled cDNA reverse tran- 
scribed from mRNA isolated from one of the cell lines with Cy3- 
labelled cDNA reverse transcribed from a reference mRNA 
sample. This reference sample, used in all hybridizations, was 
prepared by combining an equal mixture of mRNA from 12 of 
the cell lines (chosen to maximize diversity in gene expression as 
determined primarily from two-dimensional gel studies 2 ). By 
comparing cDNA from each cell line with a common reference, 
variation in gene expression across the 60 cell lines could be 
inferred from the observed variation in the normalized Cy5/Cy3 
ratios across the hybridizations. 

To assess the contribution of artefactual sources of variation in 
the experimentally measured expression patterns, K562 and 
MCF7 cell lines were each grown in three independent cultures, 
and the entire process was carried out independently on mRNA 
extracted from each culture. The variance in the triplicate fluo- 
rescence ratio measurements approached a minimum when the 
fluorescence signal was greater than approximately 0.4% of the 
measurable total signal dynamic range above background in 
either channel of the hybridization. We selected the subset of 
spots for which significant signal was present in both the numer- 
ator and denominator of the ratios by this criterion to identify 
the best-measured spots. The pair-wise correlation coefficients 
for the triplicates of the set of genes that passed this quality con- 
trol level (6,992 spots included for the MCF7 samples and 6,161 
spots for K562) ranged from 0.83 to 0.92 (for graphs and details, 
see http://genome-www.stanford.edu/nci60). 

To make the orderly features in the data more apparent, we used 
a hierarchical clustering algorithm 19 ' 20 and a pseudo-colour visu- 



alization matrix 3,21 . The object of the clustering was to group cell 
lines with similar repertoires of expressed genes and to group 
genes whose expression level varied among the 60 cell lines in a 
similar manner. Clustering was performed twice using different 
subsets of genes to assess the robustness of the analysis. In one case 
(Fig. 1), we concentrated on those genes that showed the most 
variation in expression among the 60 cell lines ( 1, 167 total). A sec- 
ond analysis (Fig. 2) included all spots that were thought to be well 
measured in the reference set (6,831 spots). 

Gene expression patterns related to the histologic 
origins of the cell lines 

The most notable property of the clustered data was that cell lines 
with common presumptive tissues of origin grouped together 
(Figs \a and 2). Cell lines derived from leukaemia, melanoma, 
central nervous system, colon, renal and ovarian tissue were clus- 
tered into independent terminal branches specific to their respec- 
tive organ types with few exceptions. Cell lines derived from 
non-small lung carcinoma and breast tumours were distributed 
in multiple different terminal branches suggesting that their gene 
expression patterns were more heterogeneous. 

Many of these coherent cell line clusters were distinguished by 
the specific expression of characteristic groups of genes 
(Fig. 3a-d). For example, a cluster of approximately 90 genes was 
highly expressed in the melanoma -derived lines (Fig. 3c). This set 
was enriched for genes with known roles in melanocyte biology, 
including tyrosinase and dopachrome tautomerase (TYR and 
DCT; two subunits of an enzyme complex involved in melanin 
synthesis 22 ), MARTI (MLANA; which is being investigated as a 
target for immunotherapy of melanoma 23 ) and S100-p (S100B; 
which has been used as an antigenic marker in the diagnosis of 
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Fig. 2 Gene expression patterns related to 
other eel Mine phenotypes. a. We applied 
two-dimensional hierarchical clustering to 
expression data from a set of 6.831 cDNAs 
measured across the 64 cell lines. The 6,831 
cDNAs were those with a minimum fluores- 
cence signal intensity of approximately 0.4% 
of the dynamic range above background in 
the reference channel in each of the six 
hybridizations used to establish reproducibil- 
ity. This effectively selected those spots that 
provided the most reliable ratio measure- 
ments and therefore identified a subset of 
genes useful for exploring patterns comprised 
of those whose variation in expression across 
the 60 cell lines was of moderate magnitude. 
b. Cluster-ordered data table, c. Doubling 
time of cell lines. Cell lines are given in cluster 
order. Values are plotted relative to the mean. 
Doubling times greater than the mean are 
shown in green, those with doubling time less 
than the mean are shown in red. cf. Three 
related gene clusters that were enriched for 
genes whose expression level variation was 
correlated with cell line proliferation rate. 
Each of the three gene clusters (clustered 
solely on the basis of their expression pat- 
terns) showed enrichment for sets of genes 
involved in distinct functional categories (for 
example, ribosomal genes versus genes 
involved in pre-RNA splicing), e. Gene cluster 
in which all characterized and sequence-veri- 
fied cDNAs encode genes known to be regu- 
lated by interferons, f, Gene cluster enriched 
for genes that have been implicated in drug 
metabolism (indicated by asterisks). A further 
property of the gene clustering evident here 
and in Fig. 2 is the strong tendency for redun- 
dant representations of the same gene to 
cluster immediately adjacent to one another, 
even within larger groups of genes with very 
similar expression patterns. In addition to 
illustrating the reproducibility and consis- 
tency of the measurements, and providing 
independent confirmation of many of our 
measurements, this property also demon- 
strates that these, and probably all, genes 
have nearly unique patterns of variation 
across the 60 cell lines. If this were not the 
case, and multiple genes had identical pat- 
terns of variation, we would not expect to be 
able to distinguish, by clustering on the basis 
of expression variation, duplicate copies of 
individual genes from the other genes with 
identical expression patterns. 



breast 
prostate 
non-smaWung 



too 

0 60 




leukaemia colon 

doubling time 



6831 
genes 



ratios 




2.0X 
1.4X 
I. OX 
I.4X 
20X 



RPLS3 RIBOSOMAL PROTEIN S3 
RPLSfl RBOSOMAL PROTEIN S3 

RPLSA RIBOSOMAL PROTON SA 
RPL32 RlBOSC*«AL PROTEIN L32 
PP2A PROTON PHOSPHATASE 2A REG 2 BETA 
IMPOH2 WOSJNE MONOPHOSPHATE DEHYDROGENASE? 
EEFIAt ELONGATION FACTOR 1 -ALPHA- 1 
EEFTA1 ELONGATION FACTOR 1 -ALPHA- 1 
EEFIA1 ELONGATION FACTOR 1 -ALPHA- 1 
RPL41 RIBOSOMAL PROTON L41 
BTF3 BASIC TRANSCRIPTION FACTOR 3 
RPL3 RIBOSOMAL PROTEIN U 
DUPLICATE SPINAL MUSCULAR ATROPHY 
H3PBI HEAT SHOCK PROTEIN 27KO 
RPLPO RIBOSOMAL PROTON PO 
RPU7 RIBOSOMAL PROTON L27 
EEFtBt ELONGATION FACTOR 1-88TA 
RPU7A RIBOSOMAL PROTEIN L27A 
RPL31 ROOSOMAL PROTON U1 
RPS13 RIBOSOMAL PROTEIN SI J 



Elf 48 INITIATION FACTOR *B 
EIF4S INITIATION FACTOR 4B 
RPUR 
STAT3 

RPL29 RfiOSOMAL PROTEIN L29 
MCMSplOSMCM 

RPLS3A RBOSOMAL PROTON S3A 
PTMAPROTHYMOS1N ALPHA 
C EN PC I CENTROMERE PROTEIN CI 
SFRS3 SPUCINO FACTOR 



SFRS1 SPUCINO FACTOR 2 
SFPO SPLICING FACTOR SFPO 
SFPQ SPUCINO FACTOR SFPO 
hSNFTH SMARCAS (SWVSNF RELATED) 
OOXS OCAQfH BOX RNA HEUCASE 
CCNA2 CYGUN A3 
MAOZL1 MAD24JKE 1 

UBCHSC UBIOWTIN CONJUGATING ENZYME 
CTCF TRANSCRIPTIONAL REPRESSOR 

T0P2A ONA TOPOtSOMERASE II ALPHA 
TOP2A DNA TOPOtSOMERASE II ALPHA 
MKMT ANTIGEN 

C0C2SC CELL DIVISION CYCLE 258 
PPP2CA PROTEW PHOSHATASE 2 CAT. ALPHA 
E18-APS EIB ASSOCIATED PROTON 



RfC4 REPLICATION FACTOR C 
A001 AOOUCIN 1 ALPHA 
WEE I PROTEIN KINASE HOMOLOG 



PL SCR 1 PHOSPHOLIPID SCRAMBLASE 
OAS1 1-5-OUGOAOENYLATE SYNTHETASE 
MX1 MYXOVTRUS RESISTANCE 1 
IFM1 INTERFERON INDUCED 41 KD 
SP140 NUCLEAR BODY PROTEIN 
CASP4 CA3PASE 4 

MTAP44 MEPATfTUS C ASSOCIATED 0*4 
trm INTERFERON. INDUCED S4 KO 



interferon cluster 



'1= 






m * 






1 * * f y ft» * 


n y l 




. .(.-in 





ADTB2 ADAPTW BETA 2 
A0TB2 ADAPTW BETA 2 

AflCCI ATP-BINDING CASS ETC. SUB-FAMILY C (MRP IT 
SMP NUCLEAR HORMONE RECEPTOR 

TXn'th K3REOOXIN ' 
TXNROI TWOREDOXIN REDUCTASE' 
GLCLR GAMMA GLUTAMYL CYSTEINE SYNTHETASE* 
AKR1C4 ALDO-KETO REDUCTASE FAMLYI. Of 
AKR1C1 ALDO-KETO REDUCTASE FAMILY I. Cf 
AKR1C1 ALOO-KETO REDUCTASE FAM0.YI. Cf 
RAB10 RA3 ONCOGENE FAMILY 



drug metabolism cluster 



cell lines 



nature genetics • volume 24 • march 2000 



229 



article 



£A © 2000 Nature America Inc. • http://genetics.nature.com 



melanoma). LOXIMVI, the seventh line designated as melanoma 
in the NCI60, did not show this characteristic pattern. Although 
isolated from a patient with melanoma, LOXIMVI has previously 
been noted to lack melanin and other markers useful for identifi- 
cation of melanoma cells 1 . 

Paradoxically, two related cell lines (MDA-MB435 and MDA- 
N), which were derived from a single patient with breast cancer 
and have been conventionally regarded as breast cancer cell lines, 
shared expression of the genes associated with melanoma. MDA- 
MB435 was isolated from a pleural effusion in a patient with 
metastatic ductal adenocarcinoma of the breast 24 ' 25 . It remains 
possible that the origin of the cell line was a breast cancer, and that 
its gene expression pattern is related to the neuroendocrine fea- 
tures of some breast cancers 26 . But our results suggest that this cell 
line may have originated from a melanoma, raising the possibility 
that the patient had a co-existing occult melanoma. 

The higher-level organization of the cell-line tree — in which 
groups span cell lines from different tissue types — also reflected 
shared biological properties of the tissues from which the cell 
lines were derived. The carcinoma- derived cell lines were divided 
into major branches that separated those that expressed genes 
characteristic of epithelial cells from those that expressed genes 
more typical of stromal cells. A duster of genes is shown (Fig. 3fc) 
that is most strongly expressed in cell lines derived from colon 
carcinomas, six of seven ovarian-derived cell lines and the two 
breast cancer lines positive for the oestrogen receptor. The named 
genes in this cluster have been implicated in several aspects of 
epithelial cell biology 27 . The cluster was enriched for genes whose 
products are known to localize to the basolateral membrane of 
epithelial cells, including those encoding components of 
adherens complexes (for example, desmoplakin (DSP), 
periplakin (PPL) and plakoglobin (JUP)), an epithelial- 
expressed cell-cell adhesion molecule (M4S1) and a sodium/ 
hydrogen ion exchanger 28 " 31 (SLC9A1). It also contained genes 
that encode putative transcriptional regulators of epithelial mor- 
phogenesis, a human homologue of a Drosophila melanogaster 
epithelial-expressed tumour suppressor (LLGL1) and a homeo- 
box gene thought to control calcium -mediated adherence in 
epithelial cells 32 - 33 (MSX2). 

In contrast, a separate, major branch of the cell-line dendro- 
gram (Fig. \a) included all glioblastoma- derived cell lines, all 
renal-ceU-carcinoma-derived cell lines and the remaining carci- 
noma-derived lines. The characteristic set of genes expressed in 
this cluster included many whose products are involved in stro- 
mal cell functions (Fig. 3*0. Indeed, the two cell lines originally 
described as 'sarcoma-like* in appearance (Hs578T, breast carci- 
nosarcoma, and SF539, gliosarcoma) expressed most of these 
genes 34,35 . Although no single gene was uniformly characteristic 
of this cluster, each cell line showed a distinctive pattern of 
expression of genes encoding proteins with roles in synthesis or 
modification of the extracellular matrix (for example, caldesmon 
(CALD1), cathepsins, thrombospondin (THBS), lysyl oxidase 
(LOX) and collagen subtypes). Although the ovarian and most 
non-small-cell-lung-derived carcinomas expressed genes charac- 
teristic of both epithelial cells and stromal cells, they probably 
clustered with the CNS and renal cell carcinomas in this analysis 
because genes characteristically expressed in stromal cells were 
more abundantly represented in this gene set. 

Physiological variation reflected 
in gene expression patterns 

A cluster diagram of 6,831 genes (Fig. 2) is useful for exploring 
clusters of genes whose variation in mRNA levels was not obvi- 
ously attributable to cell or tissue type. We identified some gene 
clusters that were enriched for genes involved in specific cellular 



processes; the variation in their expression levels may reflect cor- 
responding differences in activity of these processes in the cell 
lines. For example, a cluster of 1,159 genes (Fig. 2a) included 
many whose products are necessary for progression through the 
cell cycle (such as CCNA1, MCM106 and MAD2L1), RNA pro- 
cessing and translation machinery (such as RNA helicases, 
hnRNPs and translation elongation factors) and traditional 
pathologic markers used to identify proliferating cells (MKI67). 
Within this large cluster were smaller clusters enriched for genes 
with more specialized roles. One cluster was highly enriched for 
numerous ribosomal genes, whereas another was more enriched 
for genes encoding RNA- splicing factors. The variation in 
expression of these ribosomal genes was significantly correlated 
with variation in the cell doubling time (correlation coefficient of 
0.54), supporting the notion that the genes in this cluster were 
regulated in relation to cell proliferation rate or growth rate in 
these cell lines. 

In a smaller gene cluster (Fig. 2d), all of the named genes were 
previously known to be regulated by interferons 13 ' 36 . Additional 
groups of interferon-regulated genes showed distinct patterns of 
expression (data not shown), suggesting that theNCI60 cell lines 
exhibited variation in activity of interferon- response pathways, 
which was reflected in gene expression patterns 36 . 

Another duster (Fig. 2c) contained several genes encoding 
proteins with possible interrelated roles in drug metabolism, 
including glutamate-cysteine ligase (GLCLC, the enzyme respon- 
sible for the rate limiting step of glutathione synthesis), thiore- 
doxin (TXN) and thioredoxin reductase (TXNRD1; enzymes 
involved in regulating redox state in cells), and MRP1 (a drug 
transporter known to efficiendy transport glutathione-conju- 
gated compounds 37 ). The elevated expression of this set of genes 
in a subset of these cell lines may reflect selection for resistance to 
chemotherapeutics. 

Cell lines facilitate interpretation of gene expression 
patterns in complex clinical samples 

Like many other types of cancer, tumours of the breast typically 
have a complex histological organization, with connective tissue 
and leukocytic infiltrates interwoven with tumour cells. To 
explore the possibility that variation in gene expression in the 
tumour cell lines might provide a framework for interpreting the 
expression patterns in tumour specimens, we compared RNA 
isolated from two breast cancer biopsy samples, a sample of nor- 
mal breast tissue and the NCI60 cell lines derived from breast 
cancers (excluding MDA-MB-435 and MDA-N) and leukaemias 
(Fig. 4). This clustering highlighted features of the gene expres- 
sion pattern shared between the cancer specimens and individual 
cell lines derived from breast cancers and leukaemias. 

The genes encoding keratin 8 (KRT8) and keratin 19 (KRT19), 
as well as most of the other 'epithelial' genes defined in the com- 
plete NCI60 cell line cluster, were expressed in both of the biopsy 
samples and the two breast-derived cell lines, MCF-7 and T47D, 
expressing the oestrogen receptor, suggesting that these tran- 
scripts originated in tumour cells with features similar to those of 
luminal epithelial cells (Fig. 5a). Expression of a set of genes char- 
acteristic of stromal cells, including collagen genes (COL3AU 
COL5A1 and COL6A1) and smooth muscle cell markers 
(TAGLN), was a feature shared by the tumour sample and the 
stromal-Iike cell lines Hs578T and BT549 (Fig. 5b). This feature 
of the expression pattern seen in the tumour samples is likely to 
be due to the stromal component of the tumour. The tumours 
also shared expression of a set of genes (Fig. 5c) with the multiple 
myeloma cell line (RPMI-8226), notably including 
immunoglobulin genes, consistent with the presence of B cells 
in the tumour (this was confirmed by staining with anti- 
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Fig. 3 Gene clusters related to tissue characteristics in the cell lines. Enlargements of the regions of the cluster diagram in Fig. 1 showing gene clusters enriched 
for genes expressed in cell lines of ostensibly similar origins, a. Cluster of genes highly expressed in the leukaemia-derived cell lines. Two sub-clusters distinguish 
genes that were expressed in most leukaemia-derived lines from those expressed exclusively in the eryroblastoid line, K562 (note that the triplicate hybridiza- 
tions cluster together), b. Cluster of genes highly expressed in all colon (7/7) cell lines and all breast-derived cell lines positive for the oestrogen receptor (2/2). This 
set of genes was also moderately expressed in most ovarian lines (5/6) and some non-small-cell-lung (4/6) lines, but was expressed at a lower level in all renal-can- 
cer-derived lines, c Cluster of genes highly expressed in most melanoma-derived lines (6/7) and two related lines ostensibly derived from breast cancer (MDA- 
MB435 and MDA-N). d. Cluster of genes highly expressed in all glioblastoma (6/6) lines and most lines derived from renal-cell carcinoma (7/8). and more 
moderately expressed in a subset of carcinoma-derived lines. In all panels, names are shown only for all known genes whose identities were independently re- 
verified by sequencing. The number of sequence-validated ESTs within the cluster is indicated below the cluster in parentheses. The position of gene names in the 
adjacent list only approximates their position in the cluster diagram as indicated by the lines connecting the colour chart with the gene list. Complete cluster 
images with all gene names and accession numbers are available (http7/genome-www.stanford.edu/nci60). 
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immunoglobulin antibodies; data not shown). Therefore, dis- 
tinct sets of genes with co- varying expression among the samples 
(Fig. 4, arrow) appear to represent distinct cell types that can be 
distinguished in breast cancer tissue. A fourth cluster of genes, 
more highly expressed in all of the cell lines than in any of the 
clinical specimens, was enriched for genes present in the 'prolif- 
eration' cluster described above (Fig. Sd). The variation in 
expression of these genes likely paralleled the difference in prolif- 
eration rate between the rapidly cycling cultured cell lines and the 
much more slowly dividing cells in tissues. 

Discussion 

Newly available genomics tools allowed us to explore variation in 
gene expression on a genomic scale in 60 cell lines derived from 
diverse tumour tissues. We used a simple cluster analysis to iden- 
tify the prominent features in the gene expression patterns that 
appeared to reflect 'molecular signatures* of the tissue from 
which the cells originated. The histological characteristics of the 
cell lines that dominated the clustering were pervasive enough 
that similar relationships were revealed when alternative subsets 
of genes were selected for analysis. Additional features of the 
expression pattern may be related to variation in physiological 
attributes such as proliferation rate and activity of interferon- 
response pathways. 

The properties of the tumour- derived cell lines in this study 
have presumably all been shaped by selection for resistance to 
host defences and chemotherapeutics and for rapid proliferation 
in the tissue culture environment of synthetic growth media, fetal 
bovine serum and a polystyrene substratum. But the primary 
identifiable factor accounting for variation in gene expression 
patterns among these 60 cell lines was the identity of the tissue 
from which each cell line was ostensibly derived. For most of the 
cell lines we examined, neither physiological nor experimental 
adaptation for growth in culture was sufficient to overwrite the 
gene expression programs established during differentiation in 
vivo. Nevertheless, the prominence of mesenchymal features in 
the cell lines isolated from glioblastomas and carcinomas may 
reflect a selection for the relative ease of establishment of cell 
lines expressing stromal characteristics, perhaps combined with 
physiological adaptation to tissue culture conditions 38 " 40 . 



Fig. 4 Comparison of the gene expression patterns in clinical breast cancer 
specimens and cultured breast cancer and leukaemia cell lines, a. Two-dimen- 
sional hierarchical clustering applied to gene expression data for two breast 
cancer specimens, a lymph node metastasis from one patient, normal breast 
and the NCI60 breast and leukaemia-derived cell lines. The gene expression 
data from tissue specimens was clustered along with expression data from a 
subset of the NCI60 cell lines to explore whether features of expression pat- 
terns observed in specific lines could be identified in the tissue samples, labels 
indicate gene clusters (shown in detail in Fig. 5) that may be related to specific 
cellular components of the tumour specimens, b, Breast cancer specimen 16 
stained with anti-keratin antibodies, showing the complex mix of cell types 
characteristically found in breast tumours. The arrows highlight the different 
cellular components of this tissue specimen that were distinguished by the 
gene expression cluster analysis (Fig. 5). 



Biological themes linking genes with related expression pat- 
terns may be inferred in many cases from the shared attributes of 
known genes within the clusters. Uncharacterized cDNAs are 
likely to encode proteins that have roles similar to those of the 
known gene products with which they appear to be co- regulated. 
Still, for several clusters of genes, we were unable to discern a com- 
mon theme linking the identified members of the cluster. Further 
exploration of their variation in expression under more diverse 
conditions and more comprehensive investigation of the physiol- 
ogy of the NCI60 cells may provide insight 10 . The relationship of 
the gene expression patterns to the drug sensitivity patterns mea- 
sured by the DTP is an example of linking variation in gene 
expression with more subtle and diverse phenotypic variation 1 

The patterns of gene expression measured in the NCI 60 cell 
lines provide a framework that helps to distinguish the cells that 
express specific sets of genes in the histologically complex breast 
cancer specimens 41 . Although it is now feasible to analyse gene 
expression in micro-dissected tumour specimens 42,43 , this obser- 
vation suggests that it will be possible to explore and interpret 
some of the biology of clinical tumour samples by sampling them 
intact. As is useful in conventional morphological pathology, one 
might be able to observe interactions between a tumour and its 
microenvironment in this way. These relationships will be clari- 
fied by suitable analysis of gene expression patterns from intact as 
well as dissected tumours 1 2 ' 14,1 5,4 1 . 

Methods 

cDNA clones. We obtained the 9,703 human cDNA clones {Research Genet- 
ics) used in these experiments as bacteria] colonies in 96-well microtitre 
plates 9 . Approximately 8,000 distinct Unigene clusters (representing nomi- 
naUy unique genes) were represented in this set of clones. All genes identi- 
fied here by name represent clones whose identities were confirmed by re- 
sequencing, or by the criteria that two or more independent cDNA clones 
ostensibly representing the same gene had nearly identical gene expression 
patterns. A single-pass 3' sequence re- verification was attempted for every 
clone after re-streaking for single colonies. For a subset of genes for which 
quality 3' sequence was not obtained, we attempted to confirm identities by 
5' sequencing. Of the subset of clones selected for 5' sequence verification 
on the basis of an interesting pattern of expression (888 total), 331 were cor- 
rectly identified, 57, incorrectly identified, and 500, indeterminate (poor 
quality sequence). We estimated that 15%-20% of array elements contained 
DNA representing more than one clone per well. So far, the identities of 
-3,000 clones have been verified. The fall list of clones used and their nomi- 
nal identities are available (gene names preceded by the designation W SID#" 
(Stanford Identification) represent clones whose identities have not yet been 
verified; http://genome-www.stanford.edu:8000/nci60). 

Production of cDNA microarrays. The arrays used in this experiment were 
produced at Synteni Inc. (now Incyte Pharmaceuticals). Each insert was 
amplified from a bacterial colony by sampling ! u,I of bacterial media and 
performing PCR amplification of the insert using consensus primers for 
the three plasmids represented in the clone set (5-TTGTAAAACGACG 
GCCACTG-3\ 5 '-C AC AC AG G A AAC AG CTATG-3' ) . Each PCR product 
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(100 pi) was purified by gel exclusion, concentrated and resuspended in 
3xSSC (10 ul). The PCR products were then printed on treated glass 
microscope slides using a robot with four printing tips. Detailed protocols 
for assembling and operating a microarray printer, and printing and exper- 
imental application of DNA microarrays are available (http://cmgm. 
stanford.edu/pbrown). 

Preparation of mRNA and reference pool. Cell lines were grown from NCI 
DTP frozen stocks in RPMI-1640 supplemented with phenol red, glutamine 
(2 mM) and 5% fetal calf serum. To minimize the contribution of variations 
in culture conditions or cell density to differential gene expression, we grew 
each cell line to 80% confluence and isolated mRNA 24 h after transfer to 
fresh medium. The time between removal from the incubator and lysis of the 
cells in RNA stabilization buffer was minimized (< 1 min). Cells were lysed in 
buffer containing guanidium isothiocyanate and total RNA was purified 
with the RNeasy purification kit (Qiagen). We purified mRNA as needed 



using a poly(A) purification kit (Oligotex, Qiagen) according to the manu- 
facturer^ instructions. Denaturing agarose gel electrophoresis assessed the 
integrity and relative contamination of mRNA with ribosomal RNA. 

The breast tumours were surgically excised from patients and rapidly 
transported to the pathology laboratory, where samples for microarray 
analysis were quickly frozen in liquid nitrogen and stored at -80 °C until 
use. A frozen tumour specimen was removed from the freezer, cut into 
small pieces (-50-100 mg each), immediately placed into 10-12 ml of Tri- 
zol reagent (Gibco-BRL) and homogenized using a PowerGen 125 Tissue 
Homogenizer (Fisher Scientific), starting at 5,000 r.p.m. and gradually 
increasing to -20,000 r.p.m. over a period of 30-€0 s. We processed the Tri- 
zol/tumour homogenate as described in the Trizol protocol, including an 
initial step to remove fat. Once total RNA was obtained, we isolated mRNA 
with a FastTrack 2.0 kit (Invitrogen) using the manufacturer's protocol for 
isolating mRNA starting from total RNA. The normal breast samples were 
obtained from Clontech. 
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Fig. 5 Histologic features of breast cancer biopsies can be recognized and parsed based on gene expression patterns. Enlargements of the regions of the cluster 
diagram in Fig. 4 showing gene clusters enriched for genes expressed in different cell types in the breast cancer specimens, as distinguished by clustering with the 
cultured cell lines, a, A cluster including many genes characteristic of epithelial cells expressed in cell lines (T47D and MCF7) derived from breast cancer positive for 
the oestrogen receptor and tumours, b. Genes expressed in cell lines derived from breast cancer with stromal cell characteristics (Hs578T and BT549) and tumour 
specimens. Expression of these genes in the tumour samples may reflect the presence of myofibroblasts in the cancer specimen stroma, c. Genes expressed in leuko- 
cyte-derived cell lines, showing common leukocyte, and separate 'myeloid' and 'B-cell', gene clusters, d, Genes that were relatively highly expressed in all cell lines 
compared with the tumour specimens and normal breast. The higher expression of this set of genes involved in cell cycle transit in the cell lines is likely to reflect the 
higher proliferative rate of cells cultured in the presence of serum compared with the average proliferation rate of cells in the biopsied tissue. 
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We combined mRNA from the following cells in equal quantities to 
make the reference pool: HL-60 (acute myeloid leukaemia) and K562 
(chronic myeloid leukaemia); NCI-H226 (non -small -cell-lung); COLO 
205 (colon); SNB-19 (central nervous system); LOX-IMVI (melanoma); 
OVCAR-3 and OVCAR-4 (ovarian); CAKI-1 (renal); PC-3 (prostate); and 
MCF7 and Hs578T (breast). The criterion for selection of the cell lines in 
the reference are described in detail in the accompanying manuscript 12 . 

Doubling-time calculations. We calculated doubling times based on rou- 
tine NCI60 cell line compound screening data; and they reflect the dou- 
bling times for cells inoculated into 96-well plates at the screening inocula- 
tion densities and grown in RPMI 1640 medium supplemented with 5% 
fetal bovine serum for 48 h. We measured cell populations using sulforho- 
d amine B optical density measurement assay. The doubling time constant k 
was calculated using the equation: N/No = e^ 1 , where No is optical density 
for control (untreated) cells at time zero, N is optical density for control cells 
after 48-h incubation, and t is 48 h. The same equation was then used with the 
derived k to calculate the doubling time t by setting N/No = 2. For a given cell 
line, we obtained No and N values by averaging optical densities (N>6,000) 
obtained for each cell line for a year's screening. Data and experimental details 
are available (http://dtp.nci.nih.gov). 

Preparation and hybridization of fluorescent labelled cDNA. For each 
comparative array hybridization, labelled cDNA was synthesized by reverse 
transcription from test cell mRNA in the presence of Cy5-dUTP, and from 
the reference mRNA with Cy3-dUTP, using the Superscript II reverse-tran- 
scription kit (Gibco-BRL). For each reverse transcription reaction, mRNA 
(2 ug) was mixed with an anchored oligo-dT (d-20T-d(AGQ) primer (4 
ug) in a total volume of 1 5 ul, heated to 70 °C for 10 min and cooled on ice. 
To this sample, we added an unlabelled nucleotide pool (0.6 ul; 25 mM 
each dATP, dCTP, dGTP, and 15 mM dTTP), either Cy3 or Cy5 conjugated 
dUTP (3 ul; 1 mM; Amersham), Sxfirst-strand buffer (6 ul; 250 mM Tris- 
HCL, pH 8.3, 375 mM KCI, 15 mM MgCl 2 ), 0.1 M DTT (3 ul) and 2 ul of 
Superscript II reverse transcriptase (200 u/ul). Altera 2-h incubation at 42 
°C, the RNA was degraded by adding 1 N NaOH ( 1 .5 ul) and incubating at 
70 °C for 10 min. The mixture was neutralized by adding of 1 N HCL (1.5 
ul), and the volume brought to 500 ul with TE ( 10 mM Tris, 1 mM EDTA). 
We added Cotl human DNA (20 ug; Gibco-BRL), and purified the probe 
by centrifugation in a Centricon-30 micro-concentrator (Amicon). The 
two separate probes were combined, brought to a volume of 500 ul, and 
concentrated again to a volume of less than 7 ul. We added 10 Ug/ul 
poly(A) RNA (1 ul; Sigma) and tRNA (10 ug/ul; Gibco-BRL) were added, 
and adjusted the volume to 9.5 ul with distilled water. For final probe 
preparation, 20xSSC (2.1 ul; 1.5 M NaCI, 150 mM NaCitrate, pH 8.0) and 
10% SDS (0.35 ul) were added to a total final volume of 12 ul. The probes 
were denatured by heating for 2 min at 100 °C, incubated at 37 °C for 
20-30 min, and placed on the array under a 22 mmx22 mm glass coverslip. 
We incubated slides overnight at 65 ®C for 14-18 h in a custom slide cham- 
ber with humidity maintained by a small reservoir of 3xSSC. Arrays were 
washed by submersion and agitation for 2-5 min in 2xSSC with 0. 1% SDS, 
followed by IxSSC and then O.lxSSC. The arrays were "spun dry" by cen- 
trifugation for 2 min in a slide-rack in a Beckman GS-6 tabletop centrifuge 
in Microplus carriers at 650 r.p.m. for 2 min. 

Array quantitation and data processing. Following hybridization, arrays 
were scanned using a laser- scanning microscope (ref. 17; http://cmgm. 
stanford.edu/pbrown). Separate images were acquired for Cy3 and Cy5. We 
carried out data reduction with the program ScanAlyze (M.B.E., available 



at http://rana.stanford.edu/software). Each spot was defined by manual 
positioning of a grid of circles over the array image. For each fluorescent 
image, the average pixel intensity within each circle was determined, and a 
local background was computed for each spot equal to the median pixel 
intensity in a square of 40 pixels in width and height centred on the spot 
centre, excluding all pixels within any defined spots. Net signal was deter- 
mined by subtraction of this local background from the average intensity 
for each spot. Spots deemed unsuitable for accurate quantitation because 
of array artefacts were manually flagged and excluded from further analy- 
sis. Data files generated by ScanAlyze were entered into a custom database 
that maintains web-accessible files. Signal intensities between the two fluo- 
rescent images were normalized by applying a uniform scale factor to all 
intensities measured for the Cy5 channel. The normalization factor was 
chosen so that the mean log(Cy3/Cy5) for a subset of spots that achieved a 
minimum quality parameter (approximately 6,000 spots) was 0. This effec- 
tively defined the signal-intensity-weighted 'average* spot on each array to 
have a Cy3/Cy5 ratio of 1.0. 

Cluster analysis. We extracted tables (rows of genes, columns of individual 
microarray hybridizations) of normalized fluorescence ratios from the data- 
base. Various selection criteria, discussed in relation to each data set, were 
applied to select subsets of genes from the 9,703 cDNA elements on the 
arrays. Before clustering and display, the logarithm of the measured fluores- 
cence ratios for each gene were centred by subtracting the arithmetic mean of 
all ratios measured for that gene. The centring makes all subsequent analyses 
independent of the amount of each gene's mRNA in the reference pool. 

We applied a hierarchical clustering algorithm separately to the cell lines 
and genes using the Pearson correlation coefficient as the measure of simi- 
larity and average linkage cluste^ing 3 * ,9 ~ 2, . The results of this process are 
two dendrograms (trees), one for the cell lines and one for the genes, in 
which very similar elements are connected by short branches, and longer 
branches join elements with diminishing degrees of similarity. For visual 
display the rows and columns in the initial data table were reordered to 
conform to the structures of the dendrograms obtained from the cluster 
analysis. Each cell in the cluster-ordered data table was replaced by a graded 
colour (pure red through black to pure green), representing the mean- 
adjusted ratio value in the cell. Gene labels in cluster diagrams are dis- 
played here only for genes that were represented in the microarray by 
sequence-verified cDNAs. A complete software implementation of this 
process is available (httpV/ rana.stanford.edu/software), as well as all clus- 
tering results (http://genome-www.stanford.edu/nci60). 
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Fink, Anthony L. Chaperone-Mediated Protein Folding. Physiol Rev. 79: 425-449, 1999.-The folding of most newly v 
synthesized proteins in the cell requires the interaction of a variety of protein cofactors known as molecular 
chaperones. These molecules recognize and bind to nascent polypeptide chains and partially folded intermediates 
of proteins, preventing their aggregation and misfolding. There are several families of chaperones; those most 
involved in protein folding are the 40-kDa heat shock protein (HSP40; DnaJ), 60-kDa heat shock protein (HSP60; 
GroEL), and 70-kDa heat shock protein (HSP70; DnaK) families. The availability of high-resolution structures has 
facilitated a more detailed understanding of the complex chaperone machinery and mechanisms, including the 
ATP-dependent reaction cycles of the GroEL and HSP70 chaperones. For both of these chaperones, the binding of 
ATP triggers a critical conformational change leading to release of the bound substrate protein. Whereas the main 
role of the HSP70/HSP40 chaperone system is to minimize aggregation of newly synthesized proteins, the HSP60 
chaperones also facilitate the actual folding process by providing a secluded environment for individual folding 
molecules and may also promote the unfolding and refolding of misfolded intermediates. 

I. INTRODUCTION chaperones comprise several highly conserved families of 

unrelated proteins; many chaperones are also heat shock 
The basic paradigm of molecular chaperones is that (stress) proteins. The ubiquitous role of molecular chap- 
they recognize and selectively bind normative, but not erones continues to unfold with more discoveries each 
native, proteins to form relatively stable complexes (48). year. In the context of in vivo protein folding, chaperones 
In most cases, the complexes are dissociated by the bind- prevent irreversible aggregation of normative conforma- 
ing and hydrolysis of ATP. In addition, there are "specific" tions and keep proteins on the productive folding path- 
molecular chaperones that typically are involved in the way. In addition, they may maintain newly synthesized 
assembly of particular multiprotein complexes. Molecular proteins in an unfolded conformation suitable for trans- 
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location across membranes and bind to normative pro- 
teins during cellular stress, among other functions. It is 
likely that most, if not all, cellular proteins will interact 
with a chaperone at some stage of their lifetime. 

The focus of this review is on the functional contri- 
bution of chaperones to in vivo protein folding and as- 
sembly, especially those chaperones that are promiscu- 
ous, in that they show broad specificity for binding 
nonnative proteins. In addition to several specialized re- 
view articles on chaperones, e.g., References 11, 49, 50, 
53, 76, 77, 90, 144, 164, 190, 191, 226, 228, 252, there have 
been two recent monographs published on the subject 
(60, 150). This review is of necessity selective; the main 
goal is to furnish an up-to-date overview of the role of the 
major molecular chaperones involved in protein folding. 
In view of the vast array of literature on molecular chap- 
erones, and the ease of access to literature citations using 
the Internet, this article should not be viewed as exhaus- 
tive. 

Why do we need chaperones? After all, a basic tenet 
of in vitro protein folding has been the seminal work of 
Anfinsen (2), which demonstrated that formation of the 
native protein from the unfolded state is a spontaneous 
process determined by the global free energy minimum. 
The results indicated that the native state of small glob- 
ular proteins is determined by their amino acid sequence. 
However, the experimental conditions necessary to suc- 
cessfully fold many proteins, especially larger ones, in 
vitro, are very constrictive, usually requiring very low 
protein concentration and long incubation times and are 
usually unphysiological (e.g., relatively low tempera- 
tures). In contrast, most cells operate at ambient or ho- 
meothermically set temperatures (e.g., 37°C) where the 
hydrophobic effect will be stronger and thus protein de- 
naturation and aggregation will be bigger problems, and 
the time-frame available for successful folding is short. 
Thus there is the need for additional factors for the suc- 
cessful folding of many proteins in vivo. When one con- 
siders the crowded cellular environment within a cell, it 
becomes clear that in vitro folding experiments at low 
protein concentrations are poor models for what happens 
in the cell, where a newly synthesized protein is in an 
environment with little or no "free" water, very high con- 
centrations of other proteins and metabolites, and typi- 
cally membranes, cytoskeletal elements, and other cellu- 
lar components. Thus the need for chaperones i) to 
prevent aggregation and misfolding during the folding of 
newly synthesized chains, 2) to prevent nonproductive 
interactions with other cell components, 3) to direct the 
assembly of larger proteins and multiprotein complexes, 
and 4) during exposure to stresses that cause previously 
folded proteins to unfold, becomes evident. In the few 
cases where folding has been studied both in vivo and in 
vitro, it appears that the folding pathways are similar (148, 
190). 



Cells have solved the problem of misfolding and ag- 
gregation, to a considerable extent at least, through the 
participation of molecular chaperones in the in vivo fold- 
ing process. Many investigations in the past few years 
have confirmed the critical role of molecular chaperones 
in protein folding in the cell. Although much has been 
learned about the function of chaperones in protein fold- 
ing, and the general outline of the process is thought to be 
understood, there are still many important unresolved 
issues, and new chaperones and cochaperones are still 
being discovered. 

The molecular chaperones involved in the folding of 
newly synthesized proteins recognize nonnative substrate 
proteins predominantly via their exposed hydrophobic 
residues. The major chaperone classes are 40-kDa heat 
shock protein (HSP40; the DnaJ family), 60-kDa heat 
shock protein [HSP60; including GroEL and the T-com- 
plex polypeptide 1 (TCP-1) ring complexes], 70-kDa heat 
shock protein (HSP70), and 90-kDa heat shock protein 
(HSP90). All these chaperones can prevent the aggrega- 
tion of at least some unfolded proteins. For HSP60 and 
HSP70, their activity is modulated by the binding and 
hydrolysis of ATP. The HSP70 (DnaK in Escherichia coli) 
bind to nascent polypeptide chains on ribosomes, pre- 
venting their premature folding, misfolding, or aggrega- 
tion, as well as to newly synthesized proteins in the 
process of translocation from the cytosol into the mito- 
chondria and the endoplasmic reticulum (ER). The HSP70 
are regulated by HSP40 (DnaJ or its homologs). The 
HSP60 are large oligomeric ring-shaped proteins known 
as chaperonins that bind partially folded intermediates, 
preventing their aggregation, and facilitating their folding 
and assembly. This family is composed of GroEL-like 
proteins in eubacteria, mitochondria, and chloroplasts 
and the TCP-1 (CCT or TRiC) family in the eukaryotic 
cytosol and the archaea. The HSP60 (GroEL in E. coli) are 
large, usually tetradecameric proteins with a central cav- 
ity in which nonnative protein structures bind. The HSP60 
are found in all biological compartments except the ER. 
The HSP60 are regulated by a cochaperone, chaperonin 
10 (cpnlO) (GroES in E. coli). In addition to preventing 
aggregation, it has been suggested that HSP60 may permit 
misfolded structures to unfold and refold. The HSP90 are 
associated with a number of proteins and play important 
roles in modulating their activity, most notably the steroid 
receptors. A number of other proteins involved in the 
folding of many newly synthesized proteins are often 
considered to be molecular chaperones; these include 
protein disulfide isomerase and peptidyl prolyl isomerase, 
which catalyze the rearrangement of disulfide bonds and 
isomerization of peptide bonds around Pro residues, re- 
spectively, and are perhaps better considered to be fold- 
ing catalysts rather than chaperones. As mentioned pre- 
viously, there are also a number of more specific 
chaperones that are involved in the folding/assembly of 
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only one, or a very limited number, of particular substrate 
proteins. 

Chaperones are catalysts in the sense that they tran- 
siently interact with their substrate proteins but are not 
present in the final folded product, and also in that they 
increase the yield of folded protein. However, there is no 
good evidence that they actually enhance the spontane- 
ous rate of folding itself, although they may appear to do 
this by minimizing off-pathway reactions. 

A brief perusal of the literature demonstrates that our 
knowledge of molecular chaperones is growing at an 
enormous pace. To put these new discoveries in context, 
a few more general points are worthy of note. Although in 
many respects the field of molecular chaperones can now 
be considered a mature one, in that it has passed its first 
decade of life, and the broad outlines, at least, are rea- 
sonably well established, there are still many outstanding 
questions. Furthermore, there are many areas of consid- 
erable controversy, and many of these relate to funda- 
mental questions. For example, we do not yet know with 
certainty whether all newly synthesized proteins interact 
with chaperones, although it is likely that they do. We 
certainly do not know much about all the interactions 
between the various chaperones themselves, as well as 
with newly synthesized proteins or other chaperone tar- 
get proteins. As discussed in this review, there are signif- 
icant controversies concerning which chaperones inter- 
act first with nascent polypeptides, and even whether all 
nascent polypeptides interact with chaperones. The 
GroEL family of chaperones has been intensively studied, 
especially in the context of in vitro protein folding, yet it 
is not clear just how important a role this family (the 
cpn60 chaperonins and their TCP-1 eukaryotic homologs) 
play in the folding of most proteins in the cell. We are only 
now beginning to get a picture of the apparently ubiqui- 
tous role of the HSP90 family in many critical processes in 
the cell, especially those involving protein-protein inter- 
actions. Recently, several new "accessory" proteins have 
been discovered, which apparently act as "co chaper- 
ones." Again, their significance to protein folding and 
denaturation in the cell in general is unclear at the present 
time; they may be highly specialized or may turn out to be 
critical in a broad range of cellular processes involving 
chaperones. Although some of the chaperones clearly are 
important in preventing protein aggregation, there is as 
yet no good evidence that chaperones play a role signifi- 
cant in the opposite side of this equation, namely, in 
solubilizing protein aggregates, although it would seem 
likely that this may in fact be a function of some chaper- 
ones. Even at the level of the specific mechanisms of 
chaperone function, there are many controversial as- 
pects, and those in the field know there have been some 
quite rancorous discussions over competing mechanisms. 
Thus the molecular chaperone field is one in which there 
are still many outstanding questions, including some quite 



fundamental ones. Consequently, chaperone scientists 
are likely to remain busy for a long time to come. 

We begin with a brief review of the current under- 
standing of in vitro protein folding and the potential for 
aggregation and misfolding. 

II. IN VITRO PROTEIN FOLDING 

Despite the fact that in vitro folding may not exactly 
mimic folding in the cell, it is minimally a good model for 
in vivo protein folding and has the critical advantage that 
a very wide variety of biophysical methods may be ap- 
plied to provide a detailed knowledge of the folding path- 
way, kinetics, and energetics. Significant increases in our 
understanding of the folding process have occurred in the 
past few years, especially through the application of so- 
phisticated new techniques, and these have been summa- 
rized in recent reviews (28, 29, 43, 44, 51, 56, 57, 177, 186, 
255, 267). Both in vivo and in vitro, proteins fold remark- 
ably rapidly, indicating that the folding pathway is di- 
rected in some way. Many studies have revealed interme- 
diates during in vitro protein folding experiments; it is not 
clear, and is very difficult to establish experimentally, 
whether these are on- or off-pathway species. Although it 
is becoming apparent that in some cases these may be 
off-pathway species (208), some appear to be true inter- 
mediates on the productive folding pathway, consistent 
with rugged energy landscapes (248). 

Small proteins may, under appropriate conditions, 
fold to the native state within a few tens of milliseconds 
with no detectable intermediates (177, 210). Such folding 
is consistent with smooth funnel energy landscape mod- 
els (248), i.e., no intermediates, but could also reflect very 
fast folding with intermediates of sufficiently short life- 
times that they are not detected by current methods (44). 
However, for many systems there is substantial experi- 
mental data to support the presence of partially folded 
intermediates during folding. Although stopped-flow cir- 
cular dichroism kinetics investigations reveal substantial 
secondary structure formation within a few milliseconds 
of the initiation of folding, most proteins take much 
longer to achieve the native state (seconds or longer). 

The earliest stages of folding involve hydrophobic 
collapse to a relatively compact state and formation of 
metastable secondary structure. It is not clear if collapse 
or secondary structure occur simultaneously or if one 
precedes the other. It is most likely that both proceed 
concurrently. Certainly secondary structural units may be 
formed on a microsecond time scale (24). There is no 
conclusive data yet available on how fast the collapse 
occurs. 

The nature of this initial collapsed state will vary 
depending on the conditions and the particular protein, 
but in general, it will consist of a very large number of 
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substates. Further condensation will lead to one or more 
particularly stable intermediates; again depending on the 
particular protein, the intermediate(s) will have regions of 
unique structure, especially in terms of the compactness, 
amount of secondary structure, and topology. It is very 
likely that in most cases these intermediates will consist 
of a core of nativelike structure with the remainder of the 
protein in varying degrees of disorder. Regions of the 
nonordered chain are probably flickering in and out of 
their nativelike secondary structure conformation. At 
least some proteins fold via a hierarchical path in which 
additional structural units coalesce to an initially formed 
core with nativelike structure (61). 

It is now clear, based on investigations of transient 
and equilibrium intermediates in vitro, that partially 
folded intermediates, as found with newly synthesized 
proteins in the cell, are particularly prone to aggregate, 
probably via specific intermolecular interactions between 
hydrophobic surfaces of structural subunits (59, 255). The 
intermediates are more prone to aggregate than the un- 
folded state because in the latter the hydrophobic side 
chains are scattered relatively randomly in many small 
hydrophobic regions, whereas in the partially folded in- 
termediates, there will be large patches of contiguous 
surface hydrophobicity that will have a much stronger 
propensity for aggregation. The tendency of partially 
folded intermediates to associate or aggregate is exacer- 
bated as the protein concentration increases. The growing 
recognition of the critical importance of protein aggrega- 
tion has resulted in a number of reviews (42, 59, 112, 
253-255). 

A. Molecular Chaperones and Protein Aggregation 

Both in vivo and in vitro the transition of a protein 
from the unfolded to folded state frequently results in the 
formation of partially folded intermediate states that have 
a very strong propensity to aggregate. In vivo this may 
lead to formation of inclusion bodies, especially when 
overexpression occurs. Members of the HSP60 and HSP70 
molecular chaperone families seem to be most directly, 
and most generally, involved in preventing this. Current 
understanding of the role of HSP70 in protein folding 
suggests that the chaperone sequesters the unfolded or 
partially folded protein, thereby preventing its aggrega- 
tion, but does not actively participate in the folding pro- 
cess; subsequent binding of ATP leads to release of the 
substrate protein in a nonnative conformation (144, 146, 
166, 167). The E. coli HSP60 chaperone GroEL and its 
eukaryotic homologs facilitate protein folding by binding 
partially folded intermediates (or partially folded domains 
of large multidomain proteins) in their large central cavity 
(see sect. vE). Folding can thus occur in a situation where 
aggregation is precluded (144, 229). The general outline is 
summarized in Figure 1. 
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fig. 1. General outline of chaperone-mediated protein folding in a 
cell. U represents nascent polypeptide or newly synthesized protein. 
Chaperones include 40-, 60-, and 70-kDa heat shock proteins, protein 
disulfide isomerase, and peptidyl prolyl isomerase or trigger factor. For 
multisubunit proteins, the situation is more complex and not yet well 
understood. 



It is likely that a significant factor in the formation of 
in vivo aggregates, such as inclusion bodies, is a lack of 
available molecular chaperones, usually due to either the 
rapid rate of protein synthesis, the formation of long-lived 
folding intermediates, or a combination of both. Either 
situation could lead to saturation of the available chaper- 
ones. The longer a protein takes to fold spontaneously, 
the longer it is likely to remain associated with the HSP60 
and HSP70 chaperones. Some proteins fold fast and may 
have partially folded intermediates that have little propen- 
sity to aggregate, thus requiring little or no chaperone 
assistance and little tendency to form inclusion bodies. 

Several experiments have been conducted in which 
overexpression of various combinations of the DnaK and 
GroEL chaperone systems decreases the amount of ag- 
gregation (8, 72, 81, 134, 229). For example, newly syn- 
thesized proteins in E. coli were shown to aggregate 
extensively when the rpoH mutation was present (81). 
This mutation in the RNA polymerase o^-subunit, which 
is responsible for heat shock promotor recognition, leads 
to a lack of heat shock proteins. Although growth is 
normal at 30°C, on elevating the temperature to 42°C, the 
cell is unable to produce sufficient chaperones and mas- 
sive aggregation is observed. Overproduction of either 
GroEL and GroES, or DnaK and DnaJ, significantly de- 
creases the aggregation at 42°C. If overexpressed to- 
gether, the four chaperones are able to suppress most of 
the aggregation. The data suggest that the GroEL/GroES 
and the DnaK/DnaJ chaperone systems have complemen- 
tary functions in the folding and assembly of most pro- 
teins. In addition, for in vitro aggregating systems, the 
presence of various chaperones increases the yield of 
soluble or native protein (16, 18, 97, 221). There have been 
conflicting reports as to whether the DnaK or GroEL 
systems, individually or together, yield the optimal 
amount of renaturation. It appears that in some cases all 
the chaperones are required for maximal suppression of 
aggregation, whereas in others either the DnaK system 
alone, or the GroEL system alone, was effective (229). It 
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is possible that the two systems interact at different 
stages of folding, and thus different results may be ob- 
served depending on the particular system (16). 

III. MOLECULAR CHAPERONES INVOLVED IN 
IN VIVO PROTEIN FOLDING 

The major classes of general chaperones are the 
HSP40, HSP60, HSP70, HSP90, 100-kDa heat shock pro- 
tein (HSP100), and the small heat shock proteins. Recent 
investigations have shown that not only do the major 
classes of chaperones often function with protein cofac- 
tors, but direct interactions between members of the 
HSP40, HSP70, and HSP90 families may be frequent This 
section provides a brief description of the main families of 
molecular chaperones involved in protein folding in the 
cell. 

A. Small Heat Shock Proteins and a-Crystallins 

The small heat shock protein (HSP) and a-crystallin 
family consists of 12- to 43-kDa proteins that assemble 
into large multimeric structures and contain a conserved 
COOH-terminal region termed the a-crystallin domain. 
Many of the small HSP are produced only under stress 
conditions. They have been shown to function in vitro as 
chaperones by preventing protein aggregation in an ATP- 
independent manner. Several recent reviews have been 
published (19, 47, 113, 197). The role of small HSP in 
protein folding in vivo is unclear, but it seems unlikely 
that they are major players; this probably reflects the fact 
that release of bound, denatured proteins from the small 
heat shock proteins is very slow or nonexistent. For the 
a-crystallins in the eye lens, a major role is to bind dena- 
tured proteins and prevent their aggregation (which 
would result in cataracts). The small HSP bind denatured 
proteins tightly, but there is little evidence at present that 
they normally release the bound material subsequently. It 
has been proposed that their major function may be in 
times of stress when they bind denatured proteins and 
prevent their aggregation. Subsequently, when the stress 
is removed, these complexes may provide a reservoir for 
the HSP70 chaperone machinery to renature the bound 
proteins (47). The small HSP exhibit high affinity for 
partially folded intermediates but show no apparent sub- 
strate specificity and are only functional in the oligomeric 
form (131). Little is known about the mechanism of action 
of the small HSP; it has been suggested that the substrate 
protein coats the outside of the large chaperone multimer 
(133) and that hydrophobic interactions are critical in 
substrate binding. Several models have been proposed for 
the quaternary structure of the small HSP, but no consen- 
sus exists. A model in which small HSP prevent protein 
aggregation and may facilitate substrate refolding in con- 



junction with other molecular chaperones has recently 
been proposed (133). 

B. HSP40 Family 

The HSP40 or DnaJ family consists of over 100 mem- 
bers, defined by the presence of a highly conserved J 
domain of —78 residues (DnaJ from E. coli has 376 amino 
acids) (131). Proteins in this family typically consist of 
several domains, e.g., DnaJ contains at least four con- 
served regions representing potential functional domains 
(the J domain, which is linked by a Gly/Phe-rich region to 
a domain of unknown function, followed by a zinc-finger 
region, and ending with the COOH-terminal domain, also 
of unknown function). Much variability is seen in the 
nonJ domains of members of this family. The best stud- 
ied examples are DnaJ from E. coli and several homologs 
from yeast, such as Mdjl and Ydjl (33, 34, 189). The best 
defined role thus far for the HSP40 is as a cochaperone for 
HSP70; however, even this function is not well under- 
stood, and there is evidence to indicate that DnaJ and 
other members of the HSP40 family are chaperones in 
their own right, binding to at least some unfolded proteins 
and nascent chains (94). The details of the putative role of 
DnaJ in protein folding are described in sections iv and v. 
In E. coli, DnaK, DnaJ, and GrpE cooperate synergisti- 
cally in a variety of biological functions, including protein 
folding. The properties of DnaJ and its homologs have 
been reviewed previously (23, 34, 131, 260). 

Little is known about the structural features of DnaJ 
that are involved in its interaction with DnaK and un- 
folded proteins. Analysis of DnaJ fragments showed that 
both the NH 2 -terrtiinal J domain and the ao^jacent glycine/ 
phenylalanine-rich region are required for interactions 
with DnaK (117) and to stimulate the ATPase activity of 
DnaK (220). The G/F motif of DnaJ is also involved in 
modulating the substrate binding activity of DnaK (246). 
However, only complete DnaJ is functional with DnaK 
and GrpE in refolding denatured firefly lucif erase. Binding 
experiments and cross-linking studies indicate that the 
zinc fingerlike domain is required for DnaJ to bind to 
nonnative proteins (220). 

Nuclear magnetic resonance spectroscopy has been 
used to determine the three-dimensional structure of the 
J domain in DnaJ from E. coli and humans (101, 181, 222). 
The structure is dominated by two long helices, with a 
hydrophobic core of highly conserved side chains. The 
residues believed responsible for the specificity of the 
interaction between DnaJ and its homologs with their 
corresponding HSP70 partners comprise a conserved His- 
Pro-Asp sequence that extends out from the core of the 
structure (171, 181). A peptide containing this sequence 
inhibited the Yc^l stimulation of HSP70 ATPase activity 
but did not prevent binding of nonnative substrate pro- 
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teins, indicating that DnaJ interacts with HSP70 at a site 
distinct from the peptide binding site (238). The acjjacent 
Gly/Phe-rich domain in DnaJ is disordered and flexible in 
solution (222). 

The major effect of DnaJ on the functional cycle of 
DnaK is the significant stimulation of the ATPase rate- 
limiting step, 7-phosphate cleavage, leading to stabiliza- 
tion of DnaK-ADP-substrate protein complexes (146). 
Both prokaryotic and eukaryotic forms of HSP40 interact 
with HSP70 in the presence of ATP to suppress protein 
aggregation (32). It has been proposed that HSP40 is 
required for the efficient binding of substrate protein to 
HSP70 through the stimulation of its ATPase activity (see 
sect. vA) (147). 

It has been suggested DnaJ acts directly as a molec- 
ular chaperone in that it binds to certain denatured sub- 
strate proteins such as firefly luciferase (130, 220, 221), 
and even some specific folded proteins such as the o 32 - 
heat shock transcription factor or the AP DNA replication 
protein, but not "normal" native proteins (41, 262). How- 
ever, DnaJ binds to ct 32 at a different site than that to 
which DnaK binds. The yeast DnaJ homolog Ycyi was 
found to bind to denatured rhodanese but not unfolded 
reduced carboxymethylated a-lactalbumin (32, 35). As 
discussed in section iv, DnaJ or HSP40 has been proposed 
to bind to nascent polypeptides to prevent their prema- 
ture folding and to target HSP70 to them (70). However, 
unambiguous data to support this role are scant. In yeast, 
DnaJ and its homologs are required not only for protein 
folding but also for selective ubiquitin-dependent degra- 
dation of abnormally folded proteins (132). 

Significant specificity in the interactions between 
members of the HSP70, DnaJ, and GrpE families has been 
observed (40). As noted, the interaction between a given 
HSP70 and its interacting DnaJ is detemiined by the J 
domain (198). Recently, evidence for interactions be- 
tween DnaJ homologs and HSP90 have been reported 
(118). It has also been suggested that DnaJ possesses an 
active dithiol/disulfide group and may catalyze protein 
disulfide formation, reduction, and isomerization (38). 

C. HSP60 Family 

Under the rubric of the HSP60 or chaperonin family, 
we consider both the GroEL and TCP-1 ring complex 
families. Unfortunately, different research groups have 
used different names for the TCP-1 ring complex, e.g., 
TRiC (for TCP-1 ring complex) and CCT (for chaperonin 
containing TCP-1). Other members include the Rubisco 
subunit binding protein and thermophilic factor 55 from 
archaea. GroEL and its homologs are found in pro- 
karyotes, chloroplasts, and mitochondria, whereas TCP-1 
and its homologs are found in the eukaryotic cytosol. 
Many of the HSP60 chaperones are also known as chap- 



eronins (cpn60) and are ring-shaped oligomeric protein 
complexes with a large central cavity in which normative 
proteins can bind. In bacteria, at least, HSP60 require a 
cochaperonin, GroES (cpnlO), for full function. The term 
chaperonin was originally coined by Ellis (48) to refer to 
non-heat-induced HSP60. 

GroEL is probably the most studied of all molecular 
chaperones; in combination with its cochaperonin GroES 
and ATP, it facilitates protein folding, not only by prevent- 
ing aggregation but also by simultaneously allowing par- 
tially folded intermediates to fold in an environment con- 
ducive to stabilizing the native state. It has been 
suggested that GroEL may also function by unfolding 
misfolded states so as to allow their productive refolding 
(268, 269). Members of the HSP60 family are also involved 
in the assembly of large multiprotein complexes such as 
Rubisco (27, 243). The availabihty of a high-resolution 
crystallographic structure, in conjunction with mutagen- 
esis studies, has helped in the elucidation of the details of 
the reaction cycle (see sect. vB). However, there are still 
many points of controversy, reflecting the complexity of 
the mechanism of this large chaperone. Recent reviews 
include References 53, 105, 108, 144. 

The structure of the E. coli chaperonin GroEL has 
been solved by X-ray crystallography (9, 12, 263) and 
electron microscopy (196) and consists of 14 identical 
subunits in two stacked heptameric rings, each contain- 
ing a central cavity. Substantial structural information 
about GroEL, GroES, and related chaperonins is avail- 
able from the chaperonin web home page: http:// 
bioc09.uthscsa.edu/~seale/Chap/struc.html. Each sub- 
unit consists of three domains: the equatorial, the 
intermediate, and the apical. The latter, forming the 
mouth of the central cavity, undergoes major confor- 
mational changes on binding of ATP and the cochap- 
eronin GroES, which lead to substantial changes in the 
hydrophobic nature of the cavity (263) (Fig. 2). In 
particular, the relatively hydrophobic cavity lining to 
which the unfolded substrate protein binds before 
GroES binding becomes much more polar, coincident 
with a substantial increase in the size of the cavity. The 
hydrophobic polypeptide-binding site on the cavity-lin- 
ing surface of the apical domain was identified with the 
help of various mutants (54). These same residues are 
also essential for binding of the cochaperonin GroES, 
which is required for productive polypeptide release. 

The identity of amino acid residues at the nucleotide- 
binding sites of GroEL/GroES was determined by pho- 
toaffinity labeling with 2-azido-ATP (13). The labeled site 
is located at the GroEL/GroEL subunit interface, and 
labeling of the cochaperonin GroES occurred through a 
conserved proline. The 2.4-A crystal structure of the bac- 
terial chaperonin GroEL complexed with adenosine 5'-0- 
(3-thiotriphosphate) bound to each subunit shows that 
ATP binds in a pocket with a unique nucleotide-binding 
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fig. 2. Structure of GroEIVGroES based on electron diffraction. A 
cross section through a stacked GroES/GroEL/GroEL unit is shown. 
Main features starting at top are GroES; enlarged cavity; boundary 
between stacked GroEL units, flanked by equatorial domains; and 
smaller cavity, with apical domain pointing inward. Major change in 
cavity size upon binding GroES is clearly visible. [From Chen et al. (26).] 

motif, whose primary sequence is highly conserved 
among chaperonins (9). 

The 45-A-diameter cavity in GroEL is large enough to 
accommodate proteins of 40-50 kDa and, as noted, is 
larger when capped by the GroES heptamer. Conforma- 
tional changes are observed on binding nucleotides or 
GroES. The hydrophobic nature of the central cavity in 
GroEL (in the absence of the cochaperonin) presumably 
accounts for the lack of affinity for native proteins. GroES 
enhances the cooperativity of ATP binding and hydrolysis 
by GroEL and is necessary for the release and folding of 
many GroEL substrates. The crystallographic structure of 
GroES is known to high resolution (110). GroES has a 
highly mobile and accessible polypeptide loop whose mo- 
bility and accessibility are lost upon formation of the 
GroES/GroEL complex (128, 129). 

The TCP-1 is a heteroligomeric 970-kDa complex 
containing several structurally related subunits of 52-65 
kDa found in the eukaryotic cytosol. These are assembled 
into a ring complex that resembles the GroEL double ring 
(69, 121, 187). In vitro, the TCP-1 ring complex appears to 
function independently of a small cochaperonin protein 
such as GroES. Thus far, TCP-1 complexes have been 
shown to be involved in the folding of very few proteins in 
the eukaryotic cytosol. 

The major difference between TCP-1 complexes and 
GroEL is the heteroligomeric nature of the TCP-1 ring 
complex; at least eight subunit species that are encoded 
by unique genes are known (216). The genes are calcu- 
lated to have diverged around the starting point of the 
eukaryotic lineage and share —30% amino acid identity. It 
has been proposed that this complexity may have evolved 
to cope with the folding and assembly of complex pro- 
teins in eukaryotic cells (122). Although each TCP-1 sub- 



unit is highly diverged from each other, individually they 
are quite homologous, suggesting that each subunit has a 
specific, independent function (30a, 121). 

D. HSP70 Family 

The HSP70 are a family of molecular chaperones that 
are involved in protein folding and several other cellular 
functions and that exhibit weak ATPase activity. The 
HSP70 chaperones are composed of two major functional 
domains. The NH 2 -terminal, highly conserved ATPase do- 
main binds ADP and ATP very tightly (in the presence of 
Mg 2+ and K + ) and hydrolyzes ATP, whereas the COOH- 
terminal domain is required for polypeptide binding. Co- 
operation of both domains is needed for protein folding. 
Several recent reviews summarize the role of HSP70 mo- 
lecular chaperones in protein folding (58, 75, 83, 90, 100, 
144). Many of the functions of the E. coli HSP70, DnaK, 
require two cofactors, DnaJ (see sect. mi?) and GrpE (see 
sect. mJ). The majority of in vitro studies on HSP70 have 
been with DnaK. 

The HSP70 family is very large, with most organisms 
having multiple members; most eukaryotes have at least a 
dozen or more different HSP70, found in a variety of 
cellular compartments. Some of the better known mam- 
malian members are HSC70 (or HSP73), the constitutive 
cytosolic member; HSP70 (or HSP72), the stress-induced 
cytosolic form; BiP (or Grp78), the ER form; and mHSP70 
(or mito-HSP70, or Grp75), the mitochondrial form. In 
yeast the homologs of HSC70 and BiP are known as 
Ssal-4 and Kar2. In E. coli, the major form of HSP70 is 
DnaK. Here we will use the term HSP70 to refer to any 
member of the family. 

The crystallographic structures of the bovine HSC70 
ATPase domain, the DnaK peptide-binding domain com- 
plexed with a peptide substrate, and most recently the 
human HSP70 ATPase domain have been determined (63, 
214, 272). The ATPase domain, which is structurally sim- 
ilar to actin and hexokinase, consists of four smaller 
domains forming two lobes with a deep cleft within which 
the MgATP and MgADP bind. The structure of the pep- 
tide-binding domain consists of a /3-sandwich subdomain 
followed by a-helical segments. The peptide is bound to 
DnaK in an extended conformation through a channel 
defined by loops from the /3-sandwich. An a-helical do- 
main (the flap or latch) is believed to stabilize the com- 
plex but does not contact the peptide directly. Only five 
residues of the substrate protein make significant con- 
tacts with HSP70, explaining the previously observed 
specificity for short, hydrophobic peptides, with a strong 
preference for hydrophobic residues such as Leu in the 
central region and a strong unfavorable interaction with 
negatively charged residues (7, 64, 80, 225). A model in 
which the flap over the substrate binding pocket could be 
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in either an open conformation (to allow entry and egress 
of substrates) or closed conformation (to form a stable 
complex) has been suggested to account for the high- and 
low-affinity states of HSP70 (272). 

Recently, the substrate specificity of DnaK has been 
mapped out in detail by screening an immobilized peptide 
library (192), following up on earlier peptide-scanning 
experiments (7). DnaK binding sites in protein sequences 
occurred statistically every 36 residues. In the folded 
proteins, these sites are mostly buried, and the majority 
are found in /3-sheets. The binding motif consists of a 
hydrophobic core of four to five residues enriched partic- 
ularly in Leu, but also in lie, Val, Phe, and Tyr, and two 
flanking regions enriched in basic residues. Acidic resi- 
dues are excluded from the core and disfavored in flank- 
ing regions. On the basis of these data, an algorithm was 
established that predicts DnaK binding sites in protein 
sequences with high accuracy (192). 

The HSP70 preferentially bind unfolded or partially 
folded proteins and do not bind normal native proteins 
(although there are a few specific interactions with pro- 
teins in their native states, such as clathrin and a 32 ). It is 
likely that only some newly synthesized proteins require 
the assistance of chaperones. In coimmunoprecipitation 
studies with anti-HSP70 antibodies and pulse-chase label- 
ing, it was observed that smaller proteins were dispropor- 
tionately absent, suggesting that they may fold more rap- 
idly, either with or without the assistance of HSP70 (5). 

In fact, there is some evidence to support the notion 
that HSP70 may not interact with short-lived partially 
folded intermediates (218). The HSP70 inhibits the refold- 
ing of the mitochondrial isozyme of aspartate aminotrans- 
ferase (AAT), but not the cytosolic homolog. This has 
been attributed to HSP70 binding to a long-lived early 
folding intermediate in the folding of mitochondrial AAT, 
for which the analogous cytosolic isozyme intermediate is 
shorter lived and rapidly transforms to a more nativelike 
species that does not bind to HSP70 (3). Because there 
will always be a kinetic competition between spontane- 
ous folding and chaperone binding, intermediates with 
shorter lifetimes than that required for binding to HSP70 
would not form a complex with the chaperone (Fig. 3). 

The rapid binding kinetics for substrate proteins to 
DnaK-ATP (199) suggest that ATP-bound DnaK is the 
primary form initiating interaction with substrates for 
chaperone activity. The resulting DnaK-ATP-substrate 
complexes, however, are also characterized by rapid dis- 
sociation of bound substrate but can be stabilized by 
hydrolysis of the ATP (stimulated to a small extent by the 
substrate itself, or to a large extent by DnaJ; Ref. 146). 
The ATP-induced protein-HSP70 complex dissociation re- 
sults from a conformational change induced in HSP70 by 
ATP binding. This conformational change decreases the 
affinity of HSP70 for normative substrate proteins and 
leads to their dissociation (166). Because the binding of 
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fig. 3. A newly synthesized protein, whether still associated with 
ribosome or released, faces several competing pathways. It is likely that 
in the absence of chaperones, aggregation or other forms of misfolding 
would be the major pathway for many proteins. 



ATP occurs in the NH 2 -terminal domain and peptide bind- 
ing is in the COOH-terminal domain, it is clear that strong 
coupling between the two functional domains must exist. 

Under appropriate conditions, DnaK undergoes auto- 
phosphorylation (137). It is not yet clear if this is of 
physiological significance. At the moment, there is no 
good evidence that it is. However, GrpE and synthetic 
peptides have been observed to inhibit the phosphoryla- 
tion (the effectiveness of a given peptide correlated with 
its affinity for DnaK), whereas DnaJ had no effect on the 
reaction (169). Human HSP70 is phosphorylated in vitro 
in the presence of divalent ions, with calcium being the 
most effective. Two calcium ions were found in the hu- 
man ATPase domain structure, and calcium binding may 
facilitate phosphorylation (214). 

Various techniques have shown that HSP70 adopts at 
least three significantly different conformations, one in 
the absence of nucleotide, one with ADP bound, and one 
with ATP bound. Binding of nucleotides or polypeptides 
alters the conformations of both the nucleotide- and 
polypeptide-binding domains, further indication that the 
conformations of these two domains are highly coupled 
(71). 

Recently, a new pair of DnaK/DnaJ-like chaperones 
has been discovered in E. coli (242). Sequence differences 
between HSC66 and HSC20 compared with other HSP70/ 
HSP40 members suggest that these chaperones may have 
different peptide binding specificity and be subject to 
different regulatory mechanisms. In particular, the high 
level of constitutive expression and lack of significant 
response to temperature changes suggest that HSC66 and 
HSC20 may play an important role in the folding of certain 
newly synthesized proteins under normal cellular condi- 
tions. 

Details of the mechanism by which HSP70 interact 
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with newly synthesized and normative proteins are given 
in sections iv and vA 

E. HSP90 Family 

Members of the HSP90 family are highly conserved, 
essential proteins found in all organisms from bacteria to 
humans. Examples include the cytosolic form in eu- 
karyotes, HSP90, the ER form, Grp94, and the E. coli 
homolog HtpG. Mammalian HSP90 exist as dimers. Al- 
though there are a number of similarities between the 
activities of HSP90 and HSP70, the former has several 
identified specific interactions, for example, with cy- 
toskeleton elements, signal transduction proteins (includ- 
ing steroid hormone receptors), and protein kinases (such 
as the mitogen-activated protein kinase system). HSP90 is 
frequently found in complexes with other chaperones. In 
vitro, HSP90 exhibits chaperone activity with diverse pro- 
teins, suggesting a general function. The properties of 
HSP90 have been reviewed (10, 11, 20, 113, 175, 265). 

Recently, the crystal structure of the NH 2 -terminal 
domain of the yeast HSP90 was solved to reveal a dimeric 
structure based on a highly twisted 16-stranded /3-sheet. 
The opposing faces of the j3-sheet in the dimer define a 
potential peptide-binding cleft, suggesting that the N do- 
main may serve as a molecular "clamp" in the binding of 
ligand proteins to HSP90 (179). 

There has been a long-standing controversy as to 
whether HSP90 binds or hydrolyzes ATP. The crystal 
structures of complexes between the NH 2 -terminal do- 
main of the yeast HSP90 with ADP/ATP unambiguously 
show a specific adenine nucleotide binding site, homolo- 
gous to the ATP-binding site of DNA gyrase B. This site is 
the same as that identified for binding the antitumor agent 
geldanamycin, suggesting that geldanamycin acts by 
blocking the binding of nucleotides to HSP90 and not the 
binding of incompletely folded substrate proteins as pre- 
viously suggested. These results strongly suggest the di- 
rect involvement of ATP in the function of HSP90 (82, 
178). 

Even though HSP90 is one of the most abundant 
chaperones in the cell, its in vivo functions are poorly 
understood, and little is currently known about its role in 
chaperoning the folding of newly synthesized proteins, 
although there are hints that it does not function alone but 
is associated with several other cofactors. For example, 
HSP90 performs at least part of its function in a complex 
with members of the prolyl isomerase family, FKBP52 and 
p23 (10), and the steroid receptor complex consists of 
HSP90, HSP70, p48, the cyclophilin Cyp-40, and the asso- 
ciated proteins p23 and p60 (45). Although neither Cyp-40 
nor p23 can refold unfolded substrates, in in vitro folding 
experiments they interact with nonnative proteins and 
maintain a folding-competent intermediate (67). 



A temperature-sensitive mutant of HSP90 in yeast, 
which rapidly and completely loses activity on shift to 
high temperatures, has been used to examine the func- 
tions of HSP90 in vivo. The results suggested that HSP90 
is not required for the de novo folding of most proteins 
but is required for a specific subset of proteins that have 
greater difficulty reaching their native conformations 
(153). In vitro, in the absence of nucleotide, HSP90 can 
maintain nonnative substrate in a "folding-competent" 
state that refolds upon addition of HSP70, DnaJ homolog, 
and nucleotide (66). 

F. HSP100 Family 

The heat-inducible members of the HSP100 (or Clp) 
family of proteins have a number of very intriguing prop- 
erties and share a common function in helping organisms 
to survive extreme stress (78). They perform a diverse set 
of functions, including proteolysis. They are highly con- 
served, present in all organisms, and contain ATP and 
polypeptide binding sites. Both HSP104 and ClpA form 
six-membered ring complexes; the diameter of the inte- 
rior of the rings is much smaller than in GroEL, making it 
unlikely that the HSP100 function analogously to HSP60. 
The basic mechanisms by which these chaperones func- 
tion are not understood. There is some suggestion that 
HSP104 may act in concert with HSP70 and DnaJ ho- 
mologs to increase the yields of renatured protein (78). It 
should be noted that no human analogs of HSP104 have 
been found. 

Unlike HSP60 and HSP70, which are unable to re- 
solubilize aggregated proteins in vitro (with the exception 
of RNA polymerase), HSP104 has been observed to solu- 
bilize thermally aggregated proteins both in vivo and in 
vitro (170). Interestingly, ClpA can substitute for the ATP- 
dependent chaperone function of DnaK and DnaJ in the in 
vitro activation of the plasmid PI Rep A replication initi- 
ator protein (257). Another unusual feature of HSP104 is 
its role in triggering a prionlike disorder in yeast, involv- 
ing the extrachromosomal elements PSI+ and URE3 (37). 

G. Calnexin and Calreticulin 

Calnexin is a transmembrane molecular chaperone 
that resides in the ER. Calreticulin, which has sequence 
homology with calnexin, is a soluble ER chaperone. Both 
proteins are involved in the folding and assembly of nas- 
cent proteins in the ER in a calcium-dependent manner 
and play an important role in glycoprotein maturation and 
quality control in the ER (6, 93, 119, 259). 

Most proteins that enter the ER are cotranslationally 
modified by the addition of a complex carbohydrate struc- 
ture that undergoes subsequent modification by selective 
removal of individual hexose residues (219). Both calre- 
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ticulin and calnexin transiently interact with many newly 
synthesized proteins in the ER, with some overlap be- 
tween those proteins that bind to calreticulin and those 
that bind to calnexin. The specificity of the interaction is 
determined by the nature of the oligosaccharide and re- 
quires the trimming of glucose residues from the aspar- 
agine-linked core glycans by glucosidases. Calnexin tran- 
siently interacts with newly synthesized glycoproteins, 
specifically recognizing a monoglucosylated intermediate 
(162). Its major role appears to be to monitor glycoprotein 
folding and prevent incompletely folded proteins from 
leaving the ER. As proposed by Helenius and co-workers 
(92), carbohydrate processing and folding occur simulta- 
neously; calnexin recognizes and binds the monoglucosy- 
lated glycoprotein intermediate of the nascent chain. Af- 
ter the remaining glucose is removed, the glycoprotein is 
released from calnexin; if it is incompletely folded, it is 
reglycosylated and rebinds to calnexin. If it is folded it no 
longer binds to calnexin. Calreticulin is also specific for 
monoglucosylated glycans (172). The interactions of cal- 
reticulin and calnexin with denatured proteins are highly 
dependent on divalent metal ions or polyamines (261). 
Calnexin facilitates the folding and assembly of class I 
histocompatibility molecules and prevents formation of 
aggregates, showing that it functions as a molecular chap- 
erone (241). Protein folding in the ER is also discussed in 
section ivA 



H. Protein Disulfide Isomerase 

Protein disulfide isomerase (PDI) is a critical cofac- 
tor in the folding of many proteins that are found in the 
ER (65, 77, 143, 176). Many secreted proteins have multi- 
ple disulfide bonds, presenting potential problems for 
correct disulfide pairing during folding. In vitro studies of 
the refolding of reduced proteins show that disulfide bond 
formation occurs rapidly and is followed much more 
slowly by thiol-disulfide rearrangement leading to the cor- 
rect disulfide pairings. Thus catalysis of oxidative folding 
is necessary in vivo to rapidly generate the correct disul- 
fide bonds in newly synthesized proteins. In the eukary- 
otic ER, PDI fulfills this function. Its concentration can 
reach close to millimolar levels. The properties of PDI 
have recently been reviewed (245). In addition to strong 
affinity for unfolded proteins and peptides, it binds many 
relatively hydrophobic molecules such as steroid and thy- 
roid hormones. Hence, it is not surprising that PDI has 
been reported to have chaperone-like activity at high 
concentrations (such as inhibition of aggregation) distinct 
from its disulfide bond interactions (22, 180, 209). 

Protein disulfide isomerase has two catalytic sites 
situated in two domains homologous to thioredoxin, one 
near the NH 2 terminus and the other near the COOH 
terminus. The thioredoxin domains, by themselves, can 



catalyze disulfide formation, but they are unable to cata- 
lyze disulfide isomerizations (36). 

I. Peptidyl Prolyl Isomerase/Trigger Factor 

Under in vitro (and presumably in vivo) conditions, 
proline cis-trans isomerization may become rate limiting 
in the folding of proteins; in many cases, the presence of 
peptidyl prolyl isomerase (PPI) will enhance the rate of 
folding. Peptidyl prolyl isomerases are ubiquitous en- 
zymes found in virtually all organisms and subcellular 
compartments. Three unrelated families are known: the 
cyclophilins, the FK506-binding proteins (FKBP), and the 
parvulins (200). The former two families are also known 
as immunophilins. The trigger factor is a PPI with some- 
what similar activity, and weak homology, to FKBP. Trig- 
ger factor is an abundant cytosolic protein originally iden- 
tified by its ability to maintain the precursor of a secretory 
protein in a translocation-competent form (31). 

Structural studies of the E. coli trigger factor reveal a 
modular structure, composed of three stably folded do- 
mains, of which the catalytic one is homologous to FKBP 
(270). Trigger factor binds partially folded intermediates 
tightly. Although the isolated catalytic domain of the trig- 
ger factor retains full prolyl isomerase activity toward 
short peptides, its activity toward protein substrates is 
dramatically reduced, indicating that the polypeptide 
binding site extends beyond the FKBP domain (201). 

Trigger factor has several chaperone-like functions: it 
binds to nascent cytosolic and secretory polypeptide 
chains, and it catalyzes protein folding in vitro (98). Trig- 
ger factor interacts with GroEL in vivo and promotes its 
binding to at least some polypeptides; GroEL-trigger fac- 
tor complexes show much greater affinity for partially 
folded intermediates than GroEL alone (116). On the basis 
of studies showing that trigger factor was cross-linked to 
all tested nascent chains derived from both secreted and 
cytosolic proteins, it appears that trigger factor may act as 
a general molecular chaperone in protein synthesis (99, 
240). 

J. HSP70 Cochaperones 

In addition to DnaJ and GrpE, which function as 
cochaperones with DnaK, and have been known for sev- 
eral years, other protein cofactors that interact with 
HSP70 have been discovered recently. These include Hip 
(HSC70-interacting protein), BAG-1, and auxilin. The ex- 
istence of these cofactors illustrates the complexity of the 
HSP70 chaperone machinery in cells. 

GrpE is a key component of the HSP70 chaperone 
system for protein folding in bacteria and mitochondria. 
GrpE acts as a nucleotide exchange factor to control the 
ATPase activity of DnaK in its reaction cycle, although the 
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details of its mechanism remain unclear. GrpE has high 
affinity for monomelic native DnaK, as well as the iso- 
lated ATPase domain (185, 202). GrpE has no affinity for 
ATP or ADP, nor the oligomeric states of DnaK. The 
nucleotide exchange properties of GrpE are a conse- 
quence of the binding of GrpE to DnaK, leading to a 
conformational change involving the opening of the nu- 
cleotide cleft on DnaK, resulting in a low-affinity state for 
nucleotides. Recently, the crystal structure of GrpE 
bound to the ATPase domain of the molecular chaperone 
DnaK has been determined (89). A dimer of GrpE binds 
asymmetrically to a single molecule of DnaK. The struc- 
ture of the nucleotide-free ATPase domain complexed 
with GrpE closely resembles that of the nucleotide-bound 
mammalian HSP70 homolog, except for an outward rota- 
tion of one of the subdomains of the protein. Two long 
a-helices extend away from the GrpE dimer and suggest 
an additional role for GrpE in peptide release from DnaK. 
The functional aspects of GrpE are given in section vA. 

Hip is a novel tetrameric cochaperone involved in the 
regulation of eukaryotic HSC70, distinct from that of bac- 
terial HSP70. It appears to play a role in forming stable 
HSP70 complexes with substrate proteins. One Hip oli- 
gomer binds the ATPase domains of at least two HSC70 
molecules, dependent on activation of the HSC70 ATPase 
by HSP40. Although hydrolysis remains the rate-limiting 
step in the ATPase cycle, Hip stabilizes the ADP state of 
HSC70 that has a high affinity for substrate protein. Hip 
also appears to be a chaperone in its own right, in that it 
binds to some unfolded proteins (15, 104). 

BAG-1 is a recently discovered regulator of HSP70 
(216, 271). BAG-1 is an antiapoptotic protein and also 
interacts with several steroid hormone receptors that re- 
quire the molecular chaperones HSP70 and HSP90 for 
activation. The action of BAG-1 is similar to that of GrpE 
in bacterial cells, in that it binds to the ATPase domain of 
HSP70 and, in cooperation with HSP40, stimulates the 
rate of ATP hydrolysis by increasing the rate of release of 
ADP from HSP70 (103). BAG-1 can be coimmunoprecipi- 
tated with HSP70 from cell lysates (223). BAG-1 inhibited 
the HSP70-mediated in vitro refolding of an unfolded 
protein substrate. The binding of BAG-1 to one of its 
known cellular targets, Bcl-2, in cell lysates was found to 
be dependent on ATP, consistent with the possible in- 
volvement of HSP70 in complex formation. The identifi- 
cation of HSP70 as a partner protein for BAG-1 may 
explain the diverse interactions observed between BAG-1 
and several other proteins, including steroid hormone 
receptors and certain tyrosine kinase growth factor re- 
ceptors. 

Auxilin is a 100-kDa cofactor involved in the HSP70- 
mediated uncoating of clathrin-coated vesicles (239). 
Clathrin-coated vesicles transport selected integral mem- 
brane proteins from the cell surface and the trans-Golgi 
network to the endosomal system. Before fusing with 



their target, the vesicles must be stripped of their coats. 
Auxilin binds with high affinity to assembled clathrin 
lattices and, in the presence of ATP, recruits HSP70. The 
presence of a J domain at its COOH terminus indicates 
that auxilin is a member of the DnaJ family: deletion of 
the J domain results in the loss of cofactor activity. 

A 16-kDa cytosolic protein, called pl6, which copu- 
rifies with HSC70 from fish fiver, has been identified as a 
member of the Nm23/nucleoside diphosphate kinase fam- 
ily (135). pl6 may modulate HSC70 function by maintain- 
ing HSC70 in a monomelic state and by dissociating un- 
folded proteins from HSC70 either through protein- 
protein interactions or by supplying ATP indirectly 
through phosphate transfer. 

Hop is a recently discovered 60-kDa protein that can 
form a physical link between HSP70 and HSP90, thus 
modulating their activities (114). Hop is involved in the 
refolding of denatured protein in rabbit reticulocyte ly- 
sate and stimulates the refolding by HSP70 and YcJj-1 in a 
purified refolding system. Optimal refolding was observed 
in the presence of both Hop and HSP90. Hop preferen- 
tially formed a complex with ADP-bound HSP70 and also 
appears to bind to the ADP-bound form of HSP90. 

K. Specialized Chaperones 

Some molecular chaperones may be highly specific in 
that they interact with only one, or a very limited number, 
of target proteins; examples are PapD (127), which is 
involved in the assembly of bacterial pili, and HSP47, 
which is involved in the folding and processing of procol- 
lagen in the ER. There are many large and complex pro- 
tein machines in cells: in some of these cases, specific 
molecular chaperones are involved in their assembly 
(206). Some of the best-studied systems are bacterio- 
phage capsids and bacterial pili and flagella 

The 47-kDa HSP (HSP47) is an ER-resident chaper- 
one found in collagen-producing cells, where it interacts 
with procollagen. It has been proposed that it functions as 
a chaperone regulating procollagen chain folding and/or 
assembly, but the mechanism is not well understood 
(152). It is likely that its main function is to prevent 
aggregation and misfolding of newly synthesized procol- 
lagen chains until the correct COOH-terminal associa- 
tions have been made to yield the collagen triple helix. 
When HSP47-procollagen complexes reach the czs-Golgi 
network, the chaperone rapidly dissociates. The major 
interaction site on procollagen has been shown to be the 
pro-alpha 1 N-propeptide (109). 

Receptor-associated protein (RAP) is another exam- 
ple of a specialized molecular chaperone, in this case for 
the low-density lipoprotein receptor-related protein 
(LRP), a large receptor that binds multiple ligands. The 
major role of RAP is to facilitate correct folding of LRP 
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and to prevent the premature interaction of ligands with 
LRP (161). 

The production of native a//3-tubulin heterodimer de- 
pends on the action of cytosolic chaperonin and at least 
five protein cofactors. These reactions do not depend on 
ATP hydrolysis (230, 231). The )3-tubulin monomer re- 
lease factor, pl4, which catalyzes the release of /3-tubulin 
monomers from intermediate complexes, has recently 
been shown to be a member of the DnaJ family (141). 

There are also a number of chaperones involved in 
protein export, such as SecB from E. coli (88). SecB has 
two functions: it maintains precursors of some exported 
proteins in a conformation compatible with export, by 
preventing them from aggregating or from folding to their 
native state in the cytoplasm, and it delivers both nascent 
and completed precursors to SecA, one of the compo- 
nents of the export apparatus associated with the plasma 
membrane. Only those polypeptides that fold slowly in- 
teract significantly with SecB, even though it is able to 
bind a wide variety of normative proteins. Complexes 
between SecB and substrate proteins are in rapid equilib- 
rium with the free states (236). Thus, unlike the HSP70 
and HSP60, in which hydrolysis of ATP is coupled to the 
binding and release of substrate proteins, SecB does not 
form stable complexes with substrate proteins. This may 
reflect the fact that SecB does not mediate protein folding 
but is specialized for the protein export pathway. 

IV. INTERACTIONS OF NASCENT CHAINS 
WITH CHAPERONES 

Fundamental questions in protein biogenesis include 
at which stage the nascent protein first interacts with 
molecular chaperones, the identity of the chaperones, and 
the role of the chaperones in facilitating protein folding. 
Do they just prevent aggregation and misfolding, or do 
they play a more active role in the actual folding process? 
The involvement of chaperones in both co- and posttrans- 
lational folding is now clear. 

Considerable controversy continues regarding which 
chaperones are involved in interactions with nascent 
polypeptide chains. The initial evidence for chaperoning 
came from studies on the assembly of immunoglobulin 
light and heavy chains and the involvement of the protein 
now known as BiP (an HSP70) (85, 151). In a study with 
major implications for in vivo protein folding and assem- 
bly, Welch and co-workers (5) demonstrated that cytoso- 
lic forms of HSP70 bind cotranslationally to nascent 
polypeptide chains and to newly synthesized proteins in 
the normal (unstressed) cell in an ATP-dependent man- 
ner. The association of cytosolic HSP70 with nascent 
polypeptide in translating ribosomes has subsequently 
been confirmed in a number of organisms (154). 

There have been several reports that a high-molecu- 



lar-weight complex of proteins including various chaper- 
ones is associated with nascent (or unfolded) polypeptide 
chains during chain elongation in vitro and in vivo. Early 
evidence was observed in the renaturation of firefly lucif- 
erase in cell-free translation systems (70, 97, 205). Chap- 
erone-stabilized luciferase was associated with high-mo- 
lecular-weight complexes overlapping the distributions of 
HSP70, HSP90, and the chaperonin TRiC on gel nitration 
columns (160). Molecular chaperones that have been im- 
plicated include HSP70 (5, 86); HSP7 S 0 and HSP40 (123, 
154); HSP70, HSP40, and the TCP-1 ring complex chap- 
eronin (70); and HSP70 and HSP90 (46, 205). 

In a clever new approach, an antibody to puromycin 
was used to identify a population of truncated nascent 
polypeptides that were then probed by immunoprecipita- 
tion and chemical cross-linking with several antibodies 
that recognize the cytosolic chaperones HSP70, CCT 
(TRiC), HSP40, p48 (Hip), and HSP90, as a means of 
identifying chaperones bound to the nascent chains (46). 
The results showed that HSP70 is the predominant chap- 
erone bound to nascent polypeptides. The interaction 
between HSP70 and nascent polypeptides is apparently 
dynamic under physiological conditions but can be stabi- 
lized by depletion of ATP or by chemical cross-linking. 
Interestingly, the cytosolic chaperonin CCT (TRiC) was 
found to bind primarily to full-length, newly synthesized 
actin and tubulin. Other studies have also implicated the 
TCP-1 ring complex in the synthesis and assembly of 
tubulin and actin (69, 215, 264). 

This investigation also demonstrated that nascent 
polypeptides have a strong propensity to bind to many 
proteins nonspecifically in cell lysates. It is likely that this 
nonspecific binding is responsible for the reports of ad- 
ditional components in contact with nascent polypeptides 
(46). 

Several studies provide support for cotranslational 
interactions of molecular chaperones with nascent 
polypeptide chains, especially HSP70 and perhaps HSP40. 
The interaction of DnaJ with nascent ribosome-bound 
polypeptide chains as short as 55 residues was reported 
using firefly luciferase and chloramphenicol acetyltrans- 
ferase in cross-linking experiments (97). These investiga- 
tions showed that both folding and subsequent mitochon- 
drial translocation required DnaK, DnaJ, and GrpE and 
led to the proposal that DnaJ protects nascent polypep- 
tide chains from aggregation and, in cooperation with 
HSP70, controls their productive folding once a complete 
polypeptide or a polypeptide domain has been synthe- 
sized. Both HSP70 and HSP40 were shown to be associ- 
ated with nascent polypeptide chains in translating ribo- 
somes, whereas GroEL, although transiently associated 
with newly synthesized proteins, was absent from the 
ribosomes, suggesting that HSP70 and HSP40 play an 
early role in protein folding, whereas GroEL acts at a later 
stage (70, 72, 173). 
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Investigations using fluorescent-labeled rhodanese 
(by the cotranslational incorporation of a coumarin deriv- 
ative at the NH 2 terminus of the nascent protein) demon- 
strated the accumulation of full-length but enzymatically 
inactive polypeptides on the ribosomes. These polypep- 
tides could be activated and released by subsequent in- 
cubation with the chaperones DnaJ, DnaK, GrpE, GroEL, 
GroES, and ATP and release factor. Changes in fluores- 
cence indicated that DnaJ bound to the nascent protein 
and appeared to be essential for folding of ribosome- 
bound rhodanese into the native conformation (87, 124, 
126). 

Further support that folding of nascent proteins can 
take place on the ribosome comes from studies on par- 
tially folded intermediate states of bacteriophage P22 tail- 
spike protein and the j3-subunit of tryptophan synthase, 
which can be detected while still bound to ribosomes 
using monoclonal antibodies to the intermediates. The 
rapid appearance of the intermediates suggests that the 
nascent chains start folding during their elongation on the 
ribosomes. The newly synthesized incomplete chains 
were shown to interact with DnaK but not GroEL while 
still bound to the ribosome (235). 

There has been considerable discussion as to 
whether GroEL interacts with newly synthesized proteins 
in a cotranslational or posttranslational manner. As noted 
above, several studies indicated that DnaK and DnaJ are 
involved at an early stage in the folding of newly synthe- 
sized protein and that GroEL acts at a later stage (72). In 
a recent investigation in which rhodanese was synthe- 
sized in both in bacterial and wheat germ translation 
extracts, only posttranslational stable complexes with 
GroEL were found (184). Further evidence consistent 
with the HSP70 chaperone machinery interacting with 
newly synthesized proteins before GroEL (or concur- 
rently) comes from investigations on the synthesis of 
chloramphenicol acetyltransferase in a system genetically 
depleted of DnaK and DnaJ. Most of the chloramphenicol 
acetyltransferase failed to assemble into active trimers 
and accumulated either in a complex with GroEL or as 
inactive monomer. The addition of DnaK and DnaJ to the 
system before the start of protein synthesis led to in- 
creased formation of native chloramphenicol acetyltrans- 
ferase (244). Another investigation supporting a place for 
GroEL in the later stages of the folding of newly synthe- 
sized proteins made use of temperature-sensitive lethal 
mutations in the GroEL gene. After a shift to a nonper- 
missive temperature, the rate of general translation in the 
mutant cells was reduced, but a specific group of cyto- 
plasmic proteins failed to fold to their native states (107). 
The much more limited specificity demonstrated for 
TCP-1 chaperonins, compared with GroEL, suggests sig- 
nificantly different roles for these two classes of chaper- 
onins in the biosynthesis of proteins. It is likely that the 



functional differences reflect underlying structural differ- 
ences. 

Bukau and co-workers (16) have recently suggested 
that during the folding of newly synthesized proteins, 
DnaK and GroEL do not act in sequence, but rather the 
two chaperone systems form a "lateral network of coop- 
erating proteins." There are data to support both this and 
the sequential models, so the question remains unre- 
solved at present. Another source of controversy relates 
to the question of how many newly synthesized proteins 
require the assistance of HSP60 (chaperonins) in folding. 
Lorimer has calculated that for E. coli there is only suf- 
ficient GroEL to assist —5% of newly translated proteins 
under normal conditions (142). It is therefore likely that 
most newly synthesized proteins in E. coli fold without 
the assistance of GroEL, and this implies that most pro- 
teins fold fast enough that sequestration on DnaK to 
minimize the concentration of nonchaperone-bound pro- 
tein suffices to prevent aggregation. 

Thus, whereas the overall outline of the process of 
chaperone-mediated folding of newly synthesized pro- 
teins is clear, the details are as yet incompletely resolved. 
A nascent polypeptide will interact with HSP70 and pos- 
sibly other chaperones (probably HSP40) as it emerges 
from the ribosome. The lifetime of HSP70 complexes with 
substrate proteins under in vivo conditions is not well 
established but is likely to be comparable to the time for 
folding of many newly synthesized proteins. Dissociation 
of the newly synthesized chain from HSP70 after release 
of the nascent chain from the ribosome sets up a kinetic 
competition between rebinding to HSP70, binding to 
HSP60, spontaneous folding, aggregation, or possible 
even proteolysis (Fig. 3). 

It has also been reported that there may be significant 
differences between folding in prokaryotes and eu- 
karyotes (155). In a eukaryotic translation system, two- 
domain engineered polypeptides were observed to fold by 
sequential and cotranslational folding of their domains. 
However, in E. coli, folding of the same proteins was 
found to be posttranslational and to lead to intramolecu- 
lar misfolding of the concurrently folding domains (155). 
In addition, differences between the in vitro and in vivo 
nature of the interactions of chaperones with actin during 
refolding from denaturant have been reported (68). 

A. Folding in the Endoplasmic Reticulum 

The ER is a key compartment in cells that are spe- 
cialized for protein export and contains many chaperones 
that are essential for the production of functional proteins 
for export (250). Folding begins with the insertion of a 
preprotein into the lumen of the ER and can occur either 
posttranslationally, in which case the preprotein is com- 
pletely synthesized on cytosolic ribosomes before being 
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translocated, or cotranslationally, in which case mem- 
brane-associated ribosornes direct the nascent polypep- 
tide chain into the ER concomitant with polypeptide elon- 
gation (14). The ER has excellent quality control 
mechanisms (involving chaperones) that recognize and 
selectively retain misfolded proteins, which are then ei- 
ther degraded or refolded (30, 149). The concentration of 
the ER HSP70, BiP, is increased by elevated levels of 
misfolded proteins in the ER. How the levels of misfolded 
molecules are monitored and how this information is used 
to regulate the synthesis of BiP are still poorly under- 
stood. Likewise, the mechanisms by which oxidizing po- 
tential of the ER environment is regulated, and the mis- 
folded proteins are degraded, are also unknown. 

Although some of the major chaperones involved in 
protein folding in the ER are well studied, e.g., BiP and 
PDI, it is apparent that more have yet to be characterized. 
For example, several calcium-dependent putative chaper- 
ones have recently been identified using affinity chroma- 
tography with denatured-protein columns and elution 
with ATP (159). These proteins were identified as BiP 
(grp78), HSP90 (grp94), calreticulin, a novel 46-kDa pro- 
tein that binds azido-ATP, as well as three members of the 
thioredoxin superfamily: PDI, ERp72, and a previously 
reported 50-kDa protein (p50). Because the release of 
HSP90, PDI, ERp72, calreticulin, and p50 was stimulated 
by Ca 2+ , these proteins appear to function as Ca 2+ -depen- 
dent chaperones (159). 

Evidence is accumulating that the ER HSP70 chaper- 
one machinery is similar to that in the cytosol and bacte- 
ria, in that at least two DnaJ homologs have been found in 
the ER. For example, a yeast DnaJ homolog, Scjlp, is 
located in the lumen of the ER where it can interact with 
Kar2p (the HSP70 of the yeast ER) via the conserved J 
domain (198). Undoubtedly, chaperone-mediated folding 
in the lumen of the ER is complex, as revealed by the 
observation that the interaction of BiP with immunoglob- 
ulin fight chains during folding suggests that light chains 
undergo both BiP-dependent and BiP-independent folding 
steps and that BiP must release the light chains before 
disulfide bond formation can occur in them (94). 

B. Mitochondrial Import/Folding 

Molecular chaperones play a critical role in targeting 
proteins to the mitochondria and the subsequent folding 
of the imported protein. In support of the endosymbiont 
theory on the origin of mitochondria, the chaperones of 
the mitochondria show a high degree of similarity to 
bacterial molecular chaperones, including a GrpE ho- 
molog (mGrpE) (193). The mitochondrial HSP70 
(mHSP70) mediates protein transport across the inner 
membrane and protein folding in the matrix. These two 
reactions are carried out by two different mHSP70 com- 



plexes. The ADP-bound form of mHSP70 favors formation 
of a complex on the inner membrane; this "import com- 
plex" contains mHSP70, its membrane anchor Tim44, and 
mGrpE (106). The ATP-bound form of mHSP70 favors 
formation of a complex in the matrix; this "folding com- 
plex" contains mHSP70, the mitochondrial DnaJ homolog 
Mdjl, and mGrpE. A more detailed discussion of the role 
of chaperones in mitochondrial import and folding can be 
found in recent reviews (106, 156, 193). 

V. MECHANISMS OF CHAPE RONE FUNCTION 

Considerable effort has been expended over the past 
few years to understand the mechanistic details of chap- 
erone function. Great progress has been made, although 
considerable further study is necessary. The two best 
understood systems are those of HSP70 and GroEL. Even 
with these, the complexity of the systems, especially due 
to the interactions with cochaperones and other cofac- 
tors, has often led to apparently conflicting hypotheses. 
An additional source of potential discrepancies in behav- 
ior of the chaperones results from the effects of low 
concentrations of critical contaminants; for example, it 
has recently been shown that samples of HSP70 and 
HSP90 are often contaminated with low levels of DnaJ or 
HSP40, which may profoundly affect the experimental 
observations (204). 

It is convenient to consider the mechanism of action 
of both HSP70 and GroEL in terms of their reaction 
cycles. Both of these chaperones require cochaperones 
for their full function, GroES in the case of GroEL, and 
HSP40 (or DnaJ) in the case of HSP70. Several theoretical 
models have been proposed to account for the effects of 
chaperonins on protein folding (25, 207, 232). 

A. HSP70 Reaction Cycle 

Several models have been proposed for the reaction 
cycle of HSP70 (4, 16, 74, 90, 146, 147, 166, 174, 195, 221). 
The DnaK cycle has been the most studied and is consid- 
ered here. The reaction cycles for other HSP70 appear to 
be similar, with the exception that the cofactor GrpE will 
only be present in bacteria and mitochondria (273). 

Although the general features of the HSP70 reaction 
cycle are established, there is considerable discussion 
about the details. Many observations indicate that the 
maximal functional effect of HSP70 requires the presence 
of DnaJ (or its homologs) (and GrpE in the case of 
prokaryotes and mitochondria) (73, 102, 136, 203, 217, 
221, 249, 258, 274). The reports that DnaJ or HSP40 may 
bind at least some unfolded substrates are another source 
of confusion. It is now well established that GrpE and its 
homologs are nucleotide exchange factors and stimulate 
the ATPase cycle of DnaK or mHSP70 by increasing the 
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rate of ADP release (39, 221) and that DnaJ and its ho- 
mologs function to increase the rate of hydrolysis of 
HSP70-bound ATP (146, 221). Several studies have sug- 
gested that the action of DnaJ and GrpE with DnaK 
requires substoichiometric levels of the two cofactors 
(174). This is also consistent with the physiological molar 
ratios, in which DnaK is in large excess. 

It has been shown that HSP70 discriminates between 
folded and unfolded proteins, normally binding only the 
latter (168). In fact, it is likely that HSP70 can distinguish 
between relatively unfolded intermediates and strongly 
nativelike intermediates and binds only the former. The 
fact that HSP70 binds to certain proteins in their native 
state, e.g., clathrin, is assumed to arise from the presence 
of accessible, unfolded loops. For a given unfolded sub- 
strate protein, there will be several potential HSP70 bind- 
ing sites along the polypeptide chain, of different affinity 
for the chaperones (192). Both the conformational state 
of the substrate protein bound to HSP70 and the confor- 
mation of the substrate protein on ATP-induced release 
have been shown to be substantially unfolded (167). 

The nature of the bound nucleotide affects the con- 
formation of the chaperone and particularly its affinity for 
substrate protein. Thus complexes with ATP have low 
affinity for substrate and those with bound ADP have high 
affinity (166, 167, 174). The high affinity of HSP70 for 
nucleotides means that these chaperones will be found as 
binary complexes with ATP and ADP in the cell. Although 
the ATP complex binds substrate proteins/peptides much 
more rapidly than the ADP complex (146, 199, 224), the 
resulting ternary complex, HSP70-ATP-substrate, also re- 
leases the substrate protein very rapidly, and thus no 
productive complexes with unfolded substrate result 

(166) . In contrast, the HSP70-ADP complex, although 
binding substrate protein at a slower rate, forms a rela- 
tively stable ternary complex, HSP70-ADP-substrate 

(167) . The formation of small amounts of substrate com- 
plex when HSP70-ATP and substrate protein are mixed 
arises from ATP hydrolysis occurring during the reaction 
(stimulated by the presence of the substrate protein). 
Thus the formation of relatively long-lived complexes 
between unfolded proteins and HSP70 requires the pres- 
ence of the HSP70-ADP-substrate complex. This explains, 
at least in part, the need for DnaJ and its homologs, since 
DnaJ significantly stimulates the rate of hydrolysis of ATP 
bound by HSP70, thus leading to formation of HSP70-ADP 
(18, 146). 

The release of the substrate protein from the HSP70- 
ADP-substrate complex is triggered by the binding of 
ATP, which induces a conformational change in the pep- 
tide-binding domain (17, 84, 137, 165, 166, 227). On the 
basis of the crystallographic structure of the peptide- 
binding domain, the conformational change presumably 
involves the raising of the flap or latch, which is hypoth- 
esized to help maintain the substrate peptide bound (272). 



The cycle is completed by rebinding of another substrate 
protein molecule to the HSP70-ATP complex, or the hy- 
drolysis of the ATP, leading to formation of DnaK-ADP 
and another conformational change. That substrate pro- 
tein dissociation precedes ATP hydrolysis was demon- 
strated by comparison of the corresponding rates, the rate 
for ATP hydrolysis being significantly slower than that for 
substrate dissociation (166). 

There appear to be several potential pathways for 
substrate proteins to enter the DnaK reaction cycle: via 
binding to DnaK-ATP, to DnaK-ATP-DnaJ, to DnaJ (which 
then binds to DnaK-ATP), to DnaK-ADP, and possibly to 
DnaK-ADP-DnaJ. The concentrations of the two ternary 
complexes are expected to be quite low so these are 
probably not major entry points. The same goes for DnaK- 
ADP under normal (nonstress) conditions when the levels 
of DnaK-ATP greatly exceed those of DnaK-ADP. The 
majority of the data suggest that the DnaK-ATP complex 
will normally be the main portal for entry to the cycle. 

Most of the proposed reaction cycles of HSP70 fall 
into two broad classes: 7) those which propose that it is 
only DnaK (HSP70) with bound ATP which interacts with 
the unfolded substrate, and that the interaction of this 
ternary complex with DnaJ (HSP40) leads to rapid ATP 
hydrolysis (146), and 2) those which postulate that DnaJ 
first interacts with an unfolded (nascent) polypeptide, 
targeting it for binding to DnaK (74, 90, 221). Although 
there have been several reports that DnaJ (or HSP40) 
binds to some unfolded proteins (70, 95, 125, 126, 130, 203, 
221), unambiguous evidence that DnaJ or its homologs 
will bind to unfolded proteins in general is currently 
lacking (240). 

In the absence of the cochaperones, substrate pro- 
tein will cycle on and off the ATP complex and accumu- 
late only in the ADP complex. Although there are conflict- 
ing reports regarding the rate-limiting step in the intrinsic 
HSP70 ATPase activity, the evidence is strongly in favor 
of rate-limiting cleavage of the y-phosphate of ATP, both 
in the absence and presence of DnaJ and substrate pro- 
tein (115, 146, 147, 227). Both polypeptide substrates and 
DnaJ homologs stimulate the ATPase activity of HSP70 in 
E. coli, yeast, and human cytosol (147, 273). In the case of 
the yeast HSP70, Ssal, the DnaJ homolog YcUl also accel- 
erated release of ATP from Ssal (273), suggesting a pos- 
sible explanation for the lack of a GrpE homolog in 
eukaryotic cytosol. 

Thus the major pathway in the DnaK reaction cycle is 
likely to be the following (Fig. 4A). i) DnaK-ATP binds the 
unfolded substrate protein; the resulting complex may 
dissociate or bind DnaJ. 2) The latter complex will un- 
dergo rapid DnaJ-stimulated hydrolysis of the ATP to 
yield a "stable" DnaK-ADP-substrate protein complex 
(due to the conformational change induced by the 
ATP— > ADP transition), which may or may not also con- 
tain the DnaJ. 5) The ADP dissociates, catalyzed by GrpE, 
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fig. 4. Models of DnaK (HSP70) reaction cycle. Top: cycle starts 
with substrate protein binding to DnaK-ATP complex. Bottom: cycle 
starts with substrate protein binding to DnaJ. See text for details. 



and is replaced by ATP. 4) This induces a conformational 
change to the low-affinity form which results in dissocia- 
tion of the substrate protein, leaving a DnaK-ATP com- 
plex. 5) The latter can then either restart the cycle by 
binding a substrate protein, or it can undergo ATP hydro- 
lysis to yield a DnaK-ADP complex. This would have to 
dissociate the ADP and rebind an ATP before entering the 
productive cycle again. The rates for several of the key 
steps in the DnaK cycle have been reported (4, 84, 117, 
163, 174, 224). Because of the complexity of the system, 
the measured rates will be very sensitive to the concen- 
trations of all the species involved, as well as the temper- 
ature and pH. 

The alternative class of models in which DnaJ (or 
HSP40) acts as the initial chaperone will involve 2) the 
unfolded substrate protein binding to DnaJ, which will 
then 2) interact with DnaK-ATP, to form a transient 
HSP70-HSP40-U-ATP complex. This rapidly 3) undergoes 
hydrolysis of its ATP, resulting in the formation of a stable 
HSP70-HSP40-U-ADP complex. It is likely that HSP40 dis- 
sociates rapidly from such a complex. Displacement of 
ADP by ATP (catalyzed by GrpE in bacteria and mito- 
chondria) 4) triggers the release of substrate protein, thus 
completing the reaction cycle (Fig. 4B). 

Some of the newly synthesized proteins released by 
the HSP70 will fold spontaneously to the native state at a 
sufficiently fast rate that they neither aggregate nor bind 



to another chaperone molecule (either HSP70 or chapero- 
nin) before they are fully folded. However, for some pro- 
teins, further interaction with a chaperonin, such as 
GroEL, is apparently required for complete folding (90, 
96). 



B. GroEL Reaction Cycle 

The GroEL cycle is by far the most studied and best 
understood chaperonin reaction cycle, yet there are still 
outstanding questions. A comprehensive review has been 
published recently (53). For GroEL, the folding reaction is 
driven by cycles of binding and release of the cochaper- 
one GroES, which alternate with binding and release of 
the nonnative protein substrate (62, 234). These cycles are 
driven by ATP binding and hydrolysis that control the 
conformation of the chaperonin and its affinity for nucle- 
otides and the cochaperonin GroES. There are three ma- 
jor functional states: one in which the unfolded substrate 
is bound tightly, another in which the substrate protein is 
trapped in the cavity capped by GroES but in which 
folding can proceed because the substrate protein is not 
bound to the walls of the cavity, and a final state in which 
the substrate protein is "ejected" regardless of whether it 
is folded or not. Partially folded protein will rebind to the 
chaperonin, continuing the cycle until folding is complete 
(145). A distinction has been made between released 
nonnative conformations that are committed to folding 
and those that are not. It is assumed that the isolation of 
a partially folded intermediate in the GroES-capped, rel- 
atively polar cavity will lead to significant folding occur- 
ring, without competition from aggregation. Mutant chap- 
eronins that are able to trap (bind but not release) 
substrate protein have proven very useful in such inves- 
tigations (21, 52, 256). 

Although both symmetric and asymmetric complexes 
of GroEL with GroES have been observed (211, 234, 237), 
only the latter are believed to be physiologically func- 
tional (194, 233) (although the existence of transient sym- 
metric complexes cannot be ruled out). In the asymmetric 
complexes, the GroEL ring with GroES attached is known 
as the cis-ring, the opposing (distal) ring is the £rarcs-ring. 
The recently determined structure of the GroEL-GroES- 
(ADP) 7 complex revealed that the large rigid-block move- 
ments of the intermediate and apical domains in the cis- 
ring allowed bound GroES to stabilize a folding chamber 
with ADP confined to the cis-ring (26, 188, 263). The 
conformational changes in the apical domains doubled 
the volume of the central cavity and resulted in burial of 
the hydrophobic peptide-binding residues at the interface 
with GroES and between the GroEL subunits. These 
structural changes result in the enlarged central cavity 
having a polar surface that favors protein folding (26, 
263). The conformational changes induced in GroEL upon 
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fig. 5. GroEL reaction cycle. T and D represent ATP and ADP, respectively. I c represents an intermediate committed 
to folding to native state, whereas I uc represents intermediates that are not so committed. See text for details. {From 
Weissman et al. (251).] 



binding of ATP have been observed by several techniques 
including electron microscopy imaging, X-ray crystallog- 
raphy, and fluorescence labeling (26, 140, 237, 263). In 
addition, it has been shown that binding of nucleotide to 
one GroEL ring is strongly favored by GroES binding to 
the other ring (211). The nucleotides bind to a site near 
the top of the equatorial domain facing the cavity (9). 
Under physiological concentrations of chaperonins 
(equimolar GroEL and GroES) and nucleotides, the pre- 
dominant species is the asymmetric GroEL-GroES com- 
plex (20). 

A recent model for the GroEL reaction cycle is shown 
in Figure 5. It is generally assumed that the asymmetric 
GroEL-GroES complex is the state that binds substrate 
protein (211). This species has ADP bound in the ring that 
is capped by GroES. The substrate protein thus binds to 
the hydrophobic cavity of the distal ring (20, 53, 145, 251). 
This £rcms-complex then converts to the cis-asymmetric 
complex in which GroES caps the cavity containing the 
substrate protein. This intermediate may arise either by a 
transient symmetric complex with two GroES lids, or by 
a complex in which the trcms-GroES dissociates first. 
Binding of ATP in the c^s-ring leads to the major confor- 
mational changes, especially in the apical domain leading 
to an increase in the size of the cavity and conversion of 
its surface to a more polar environment. This leads to 
dissociation of the substrate protein from its hydrophobic 
interactions with the lining of the cavity, favoring the 
folding reaction. Simultaneously, hydrolysis of the ATP in 
the trans-ring leads to the release of GroES and the 
opportunity for the substrate protein to exit the cavity. If 
the released substrate protein has not reached the native 
state, it may rebind for another cycle (53, 55). 

Interactions between the two back-to-back rings in 



GroEL result in the allosteric regulation of ATP hydroly- 
sis, binding, and release of folding substrates and the 
cochaperonin GroES. Allosterism in ATP hydrolysis can 
be described by a model in which each ring of GroEL is in 
equilibrium between a low-affinity (T) and high-affinity 
(R) state for ATP, and in which the GroEL double ring is 
in equilibrium between three states: TT, TR, and RR. 
Electron microscopy (26, 188) images of all three alloste- 
ric states, TT, TR, and RR, have been obtained for various 
complexes (26). Unfolded substrate proteins bind prefer- 
entially to the T state and stimulate the ATPase activity of 
GroEL by both a direct effect on GroEL and a shift in the 
equilibrium from the RR state toward the more active TR 
state (266). GroES promotes the T to R transition of the 
ring distal to GroES in the GroEL-GroES complex. Owing 
to the relatively low affinity of the R conformation for 
nonfolded proteins, this transition leads to release of 
protein substrates from £rcms-ternary complexes of 
GroEL, GroES, and protein substrate. The role of this 
release mechanism may be to assist the folding of rela- 
tively large proteins that cannot form cis-ternary com- 
plexes and/or to facilitate degradation of damaged pro- 
teins that cannot fold (111, 266). GroEL undergoes a 
conformational change that is partly maintained after 
ATP hydrolysis, as long as ADP and P t are bound to the 
GroEL ring (140). 

There have been several investigations of the rates 
for individual steps in the GroEL reaction cycle that dem- 
onstrate the importance of nucleotide binding and hydro- 
lysis, and GroES binding, on the rate of substrate protein 
release (79, 91, 138, 139, 157, 182, 212, 234). Horwich and 
co-workers (21) have shown that under normal condi- 
tions the rate of hydrolysis of ATP in the ring trans to the 
bound GroES determines the rate of release of the GroES 
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and hence the rate of dissociation of the substrate pro- 
tein. This has been estimated to have a half-life of —15 s 
(21). Confirmation that this "timer" sets the length of time 
for which the folding substrate protein remains in the 
GroEL cavity comes from observations on the folding of 
mitochondrial malate dehydrogenase; trapping experi- 
ments show that its dwell time on the complex is only 20 s 

(182) . This is in good agreement with both the rate of ATP 
turnover and the dwell time of GroES on the complex but 
is much shorter than the time taken for the substrate to 
commit to the folded state. 

Evidence is accumulating to indicate that GroEL is 
able to unfold misfolded conformations (1, 158, 183, 213, 
232, 234, 268). The protection factors for the backbone 
amide protons of cyclophilin A bound to GroEL have been 
calculated from measurements of the rates of hydrogen/ 
deuterium exchange using NMR (158); in contrast to the 
native structure, similar protection factors were found 
throughout the sequence consistent with complete un- 
folding of the substrate protein. Clarke and co-workers 

(183) studied the GroELrfacilitated folding of mitochon- 
drial malate dehydrogenase and showed that the chapero- 
nin accelerated the dissociation of a misfolded interme- 
diate formed by reversible aggregation of an early 
partially folded intermediate, through a repeated binding 
and release cycle coupled to ATP hydrolysis. It is likely 
that the apparent "unfoldase" activity of GroEL actually 
arises from its preferential affinity for the unfolded con- 
formation (247). Thus, through mass action, misfolded 
intermediates will be unfolded and given a new chance to 
fold productively in the GroEL cavity. 

The key factors in the chaperonin cycle therefore are 
as follows: i) normative substrate protein binds to the 
trans-ring of GroEL, in which ADP and GroES are bound 
in the "opposite" GroEL ring. Binding is facilitated by the 
hydrophobic surfaces of the apical domain lining the cav- 
ity in the GroEL ring. 2) Subsequent ATP binding to the 
cis-ring leads to release of the ADP and GroES, followed 
by 3) binding of ATP and GroES to the c£s-ring results in 
the massive conformational change leading to the en- 
larged cavity. This conformational change triggers the 
release of the substrate protein from the surface of the 
apical domain and also "starts the clock." 4) Hydrolysis of 
the ATP in the cis-ring weakens the interaction between 
GroES and the cis-ring, and binding of ATP in the trans- 
ring leads to the complete release of the GroES and 
substrate protein with a half-life of — 15 s (194). 

VI. CONCLUDING REMARKS 

Molecular chaperones recognize and bind to nascent 
polypeptide chains and partially folded intermediates of 
proteins, preventing their aggregation and misfolding. The 
folding of most newly synthesized proteins in the cell will 



involve interaction with one or more chaperones. The 
chaperones most generally implicated in protein folding 
are the HSP40 (DnaJ), HSP60 (GroEL), and HSP70 (DnaK) 
families. Recent investigations using a wide variety of 
techniques ranging from genetics to biophysics have be- 
gun to unravel the complexities of these chaperone ma- 
chines. At the heart of the general protein folding machin- 
ery of the cell are the reaction cycles of HSP60, HSP70, 
and their cochaperones. For both these chaperone sys- 
tems, the binding of ATP triggers a critical conformational 
change ultimately leading to release of the bound sub- 
strate protein. Although both chaperone systems mini- 
mize aggregation of newly synthesized proteins, the 
HSP60 chaperones also facilitate the actual folding pro- 
cess by providing a secluded environment for individual 
folding molecules and may also promote the unfolding 
and refolding of misfolded intermediates. Different cellu- 
lar locations, with their different roles in the production 
of new proteins, have specific chaperone systems tailored 
to the demands of the specific location (e.g., ER, mito- 
chondria). Because of the critical nature of chaperones in 
maintaining orderly functioning of the cell, substantial 
redundancy is found in that multiple versions of chaper- 
ones are usually present. For selected proteins, additional 
specific chaperones are required for their folding and 
assembly. Although we now have what appears to be a 
good picture of the general outline of in vivo chaperone- 
mediated protein folding, it is clear that there are still a 
very large number of unanswered questions, especially 
regarding the molecular details. 
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ABSTRACT The recent ability to sequence whole genomes 
allows ready access to all genetic material. The approaches 
outlined here allow automated analysis of sequence for the 
synthesis of optimal primers in an automated multiplex 
oligonucleotide synthesizer (AMOS). The efficiency is such 
that all ORFs for an organism can be amplified by PCR. The 
resulting amplicons can be used directly in the construction of 
DNA arrays or can be cloned for a large variety of functional 
analyses. These tools allow a replacement of single-gene 
analysis with a highly efficient whole-genome analysis. 

The genome sequencing projects have generated and will 
continue to generate enormous amounts of sequence data. The 
genomes of Saccharomyces cerevisiae, Escherichia coli, Hae- 
mophilus influenzae (1), Mycoplasma genitalium (2), and Meth- 
anococcus jannaschii (3) have been completely sequenced. 
Other model organisms have had substantial portions of their 
genomes sequenced as well, including the nematode Caeno- 
rhabditis elegans (4) and the small flowering plant Arabidopsis 
thaliana (5). This massive and increasing amount of sequence 
information allows the development of novel experimental 
approaches to identify gene function. 

One standard use of genome sequence data is to attempt to 
identify the functions of predicted open reading frames 
(ORFs) within the genome by comparison to genes of known 
function. Such a comparative analysis of all ORFs to existing 
sequence data is fast, simple, and requires no experimentation 
and is therefore a reasonable first step. While finding sequence 
homologies/motifs is not a substitute for experimentation, 
noting the presence of sequence homology and/or sequence 
motifs can be a useful first step in finding interesting genes, in 
designing experiments and, in some cases, predicting function. 
However, this type of analysis is frequently un informative. For 
example, over one-half of new ORFs in S. cerevisiae have no 
known function (6). If this is the case in a well studied organism 
such as yeast, the problem will be even worse in organisms that 
are less well studied or less manipulate. A large, experimen- 
tally determined gene function database would make homol- 
ogy/motif searches much more useful. 

Experimental' analysis must be performed to thoroughly 
understand the biological function of a gene product. Scaling 
up from classical "cottage industry" one-gene-oriented ap- 
proaches to whole-genome analysis would be very expensive 
and laborious. It is clear that novel strategies are necessary to 
efficiently pursue the next phase of the genome projects— 
whole-genome experimental analysis to explore gene expres- 
sion, gene product function, and other genome functions. 
Model organisms, such as S. cerevisiae, will be extremely 



important in the development of novel whole-genome analysis 
techniques and, subsequently, in improving our understanding 
of other more complex and less manipulate organisms. 

The genome sequence can be systematically used as a tool 
to understand ORFs, gene product function, and other ge- 
nome regions. Toward this end, a directed strategy has been 
developed for exploiting sequence information as a means of 
providing information about biological function (Fig. 1). Ef- 
forts have been directed toward the amplification of each 
predicted ORF or any other region of the genome ranging 
from a few base pairs to several kilobase pairs. There are many 
uses for these amplicons— they can be cloned into standard 
vectors or specialized expression vectors, or can be cloned into 
other specialized vectors such as those used for two-hybrid 
analysis. The amplicons can also be used directly by, for 
example, arraying onto glass for expression analysis, for DNA 
binding assays, or for any direct DNA assay (7). As a pilot 
study, synthetic primers were made on the 96-well automated 
multiplex oligonucleotide synthesizer (AMOS) instrument (8) 
(Fig. 2). These oligonucleotides were used to amplify each 
ORF on yeast chromosome V. The current version of this 
instrument can synthesize three plates of 96 oligonucleotides 
each (25 bases) in an 8-hr day. The amplification of the entire 
set of PCR products was then analyzed by gel electrophoresis 
(Fig. 3). Successful amplification of the proper length product 
on the first attempt was 95%. This project demonstrates that 
one can go directly from sequence information to biological 
analysis in a truly automated, totally directed manner. 

These amplicons can be incorporated directly in arrays or 
the amplicons can be cloned. If the amplicons are to be cloned, 
novel sequences can be incorporated at the 5' end of the 
oligonucleotide to facilitate cloning. One potential problem 
with cloning PCR products is that the cloned amplicons may 
contain sequence alterations that diminish their utility. One 
option would be to resequence each individual amplicon. 
However, this is expensive, inefficient, and time consuming. A 
faster, more cost-effective, and more accurate approach is to 
apply comparative sequencing by denaturing HPLC (9). This 
method is capable of detecting a single base change in a 2-kb 
heteroduplex. Longer amplicons can be analyzed by use of 
appropriate restriction fragments. If any change is detected in 
a clone, an alternate clone of the same region can be analyzed. 
Modifying the system to allow high throughput analysis by 
denaturing HPLC is also relatively simple and straightforward. 

If amplicons are used directly on arrays without cloning, it 
is important to note that, even if single PCR product bands are 
observed on gels, the PCR products will be contaminated with 
various amounts of other sequences. This contamination has 
the potential to affect the results in, for example, expression 
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Fig. 1. Overview of systematic method for isolating individual 
genes. Sequence information is obtained automatically from sequence 
databases. The data are input into primer selection software specifi- 
cally designed to target ORFs as designated by database annotations. 
The output file containing the primer information is directly read by 
a high-throughput oligonucleotide synthesizer, which makes the oli- 
gonucleotides in 96-well plates (AMOS, automated multiplex oligo- 
nucleotide synthesizer). The forward and reverse primers are synthe- 
sized in the same location on separate plates to facilitate the down- 
stream handling of primers. The amplicons are generated by PCR in 
96-well plates as well. 

analysis. On the other hand, direct use of the amplicons is 
much less labor intensive and greatly decreases the occurrence 
of mistakes in clone identification, a ubiquitous problem 
associated with large clone set archiving and retrieving. 

Any large-scale effort to capture each ORF within a genome 
must rely on automation if cost is to be minimized while 
efficiency is maximized. Toward that end, primers targeting 
ORFs were designed automatically using simple new scripts 
and existing primer selection software. These script-selected 
primer sequences were directly read by the high-throughput 
synthesizer and the forward and reverse primers were synthe- 
sized in separate plates in corresponding wells to facilitate 
automated pipetting and PCR amplifications. Each of the 
resulting PCR products, generated with minimum labor, con- 
tains a known, unique ORF. 

Large-scale genome analysis projects are dependent on 
newly emerging technologies to make the studies practical and 
economically feasible. For example, the cost of the primers, a 
significant issue in the past, has been reduced dramatically to 
make feasible this and other projects that require tens of 
thousands of oligonucleotides. Other methods of high- 
throughput analysis are also vital to the success of functional 
analysis projects, such as microarraying and oligonucleotide 
chip methods (10-14). 

Changes in attitude are also required. One of the major costs 
of commercial oligonucleotides is extensive quality control 
such that virtually 100% of the supplied oligonucleotides are 
successfully synthesized and work for their intended purpose. 
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Fk;. 2. Overall approach for using database of a genome to direct 
biological analysis. The synthesis of the 6,000 ORFs (orfs) for each 
gene of S. cerevisiae can be used in many applications utilizing both 
cloning and microarraying technology. 

Considerable cost reduction can be obtained by simply de- 
creasing the expected successful synthesis rate to 95-97%. One 
can then achieve faster and cheaper whole genome coverage by 
simply adding a single quality control at the end of the 
experiment and batching the failures for resynthesis. 

The directed nature of the amplicon approach is of clear 
advantage. The sequence of each ORF is analyzed automati- 
cally, and unique specific primers are made to target each 
ORF. Thus, there is relatively little time or labor involved— for 
example, no random cloning and subsequent screening is 
required because each product is known. In the test system, 
primers for 240 ORFs from chromosome V were systematically 
synthesized, beginning from the left arm and continuing 
through to the right arm. At no point was there any manual 
analysis of sequence information to generate the collection. In 
many ways, now that the sequence is known, there is no need 
for the researcher to examine it. 

These amplicons can be arrayed and expression analysis can 
be done on ail arrayed ORFs with a single hybridization (10). 
Those ORFs that display significant differential expression 
patterns under a given selection are easily identified without 
the laborious task of searching for and then sequencing a clone. 
Once scaled up, the procedure provides even greater returns 
on effort, because a single hybridization will ultimately provide 
a "snapshot" of the expression of all genes in the yeast genome. 
Thus, the limiting factor in whole genome analysis will not be 
the analysis process itself, but will instead be the ability of 
researchers to design and carry out experimental selections. 

Current expression and genetic analysis technologies are 
geared toward the analysis of single genes and are ill suited to 
analyze numerous genes under many conditions. Additional 
difficulties with current technologies include: the effort and 
expense required to analyze expression and make mutants, the 
potential duplication of effort if done by different laboratories, 
and the possibility of conflicting results obtained from differ- 
ent laboratories. In contrast, whole genome analysis not only 
is more efficient, it also provides data of much higher quality; 
all genes are assayed and compared in parallel under exactly 
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the same conditions. In addition, ampiicons have many appli- 
cations beyond gene expression. For example, one recent 
approach is to incorporate a unique DNA sequence tag, 
synthesized as part of each gene specific primer, during 
amplification. The tags or molecular bar codes, when reintro- 
duced into the organism as a gene deletion or as a gene clone, 
can be used much more efficiently than individual mutations 
or clones because pools of tagged mutants or transformants 
can be analyzed in parallel. This parallel analysis is possible 
because the tags are readily and quantitatively amplified even 
in complex mixtures of tags (13). 

These ORF genome arrays and oligonucleotide tagged 
libraries can be used for many applications. Any conventional 
selection applied to a library that gives discrete or multiple 
products can use these technologies for a simple direct read- 
out. These include screens and selections for mutant comple- 
mentation, overexpression suppression (15, 16), second-site 
suppressors, synthetic lethality, drug target overexpression 
( 17), two-hybrid screens ( 18), genome mismatch scann ing (19), 
or recombination mapping. 

The genome projects have provided researchers with a vast 
amount of information. These data must be used efficiently 
and systematically to gain a truly comprehensive understand- 
ing of gene function and, more broadly, of the entire genome 
which can then be applied to other organisms. Such global 
approaches are essential if we are to gain an understanding of 
the living cell. This understanding should come from the 
viewpoint of the integration of complex regulatory networks, 
the individual roles and interactions of thousands of functional 
gene products, and the effect of environmental changes on 
both gene regulatory networks and the roles of all gene 
products. The time has come to switch from the analysis of a 
single gene to the analysis of the whole genome. 

Support was provided by National Institutes of Health Grants 
R37H60198 and P01H600205. 



1. Fieischmann, R. D., Adams, M. D., White, O., Clayton, R. A., 
Kirkness, E. F., et al (1995) Science 269, 496-512. 

2. Fraser, C. M., Gocayne, J. D., White, O., Adams, M. D., Clayton, 
R. A., et al (1995) Science 270, 397-403. 

3. Bult, C. J., White, O., Olsen, G. J., Zhou, L., Fieischmann, R. D., 
et al (1996) Science 273, 1058-1073. 

4. Sulston, J., Du, Z., Thomas, K., Wilson. R., Hillier, L., Staden, R., 
Halloran, N., Green, P., Thierry-Mieg, J., Qiu, L., Dear, S., 
Coulson, A., Craxton, M., Durbin, R., Berks, M., Metzstein, M.i 
Hawkins, T., Ainscough, R. & Waterston, R. (1992) Nature 
(London) 356, 37-41. 

5. Newman, T., de Bruijn, F. J., Green, P., Keegstra, K., Kende, H., 
et al (1994) Plant Physiol 106, 1241-1255. 

6. Oliver. S. (1996) Nature (London) 379, 597-600. 

7. Lashkari, D. A. (1996) Ph.D. dissertation (Stanford Univ. 
Stanford, CA). 

8. Lashkari, D. A., Hunicke-Smith, S. P., Norgren, R. M., Davis, 
R. W. & Brennan, T. (1995) Proc. Natl Acad, Sci. USA 92, 
7912-7915. 

9. Oefner, P. J. & Underhill, P. A. (1995) Am. J. Hum. Genet. 57, 
A266. 

Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. (1995) 
Science 270, 467-470. 

Fodor, S. P., Read, J. L., Pirrung, M. C, Stryer, L., Lu, A. T. & 
Solas, D. (1991) Science 251, 767-773. 
12. Chee, M., Yang, R., Hubbell, E., Berno, A., Huang, X. C, Stern, 
D„ Winkler, J., Lockhart, D. J., Morris, M. S. & Fodor, S. P. 
(1996) Science 274, 610-614. 

Shoemaker, D. D.. Lashkari, D. A., Morris, D., Mittmann, M. & 
Davis, R. W. (1996) Nat. Genet. 14, 450-456. 
Smith, V., Chou, K., Lashkari, D., Botstein, D. & Brown, P O 
(1996) Science 274, 2069-2074. 

Magdolen, V.. Drubin, D G., Mages, G. & Bandlow, W. (1993) 
FEBSLett. 316, 41-47. 

Ramer, S. W„ Elledge, S. J. & Davis, R. W. (1992) Proc. Natl 
Acad. Sci. USA 89, 11589-11593. 

17. Rine, J., Hansen, W., Hardeman, E. & Davis, R. W. (1983) Proc. 
Natl Acad. Sci. USA 80, 6750-6754. 

18. Fields, S. & Song, O. (1989) Nature (London) 340, 245-246. 

19. Nelson, S. F., McCusker, J. H., Sander, M. A., Kee, Y., Modrich, 
P. & Brown, P. O. (1994) Nat. Genet. 4, 11-18. 



10. 



11 



13. 



14, 



15 



16. 



Reference 2 of 9 

with Response dated 4/23/04 

InUSSN: 10/049,742 



MOLECULAR CARCINOGENESIS 24:153-159 (1999) 



^^^■"^^^^^^■■M^^^M IN PERSPECTIVE 

Claudio J. Conti, Editor 

Microarrays and Toxicology: The Advent of 
Toxicogenomics 

Emile F. Nuwaysir, 1 Michael Bittner, 2 Jeffrey Trent 2 J. Carl Barrett 1 and Cynthia A. Afshari 1 

1 Laboratory of Molecular Carcinogenesis, National Institute of Environmental Health Sciences, Research Triamle Park 
North Carolina 

laboratory of Cancer Genetics, National Human Genome Research Institute, Bethesda, Maryland 

The availability of genome-scale DNA sequence information and reagents has radically altered life-science 
research. This revolution has led to the development of a new scientific subdiscipline derived from a combina- 
tion of the fields of toxicology and genomics. This subdiscipline, termed toxicogenomics, is concerned with the 
identification of potential human and environmental toxicants, and their putative mechanisms of action, through 
the use of genomics resources. One such resource is DNA microarrays or "chips," which allow the monitoring of 
the expression levels of thousands of genes simultaneously. Here we propose a general method by which gene 
expression, as measured by cDNA microarrays, can be used as a highly sensitive and informative marker for 
toxicity. Our purpose is to acquaint the reader with the development and current state of microarray technol- 
ogy and to present our view of the usefulness of microarrays to the field of toxicology. Mol. Carcinog. 24:153- 
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INTRODUCTION 

Technological advancements combined with in- 
tensive DNA sequencing efforts have generated an 
enormous database of sequence information over the 
past decade. To date, more than 3 million sequences, 
totaling over 2.2 billion bases [1], are contained 
within the GenBank database, which includes the 
complete sequences of 19 different organisms [2]. 'The 
first complete sequence of a free-living organism, 
Haemophilus influenzae, was reported in 1995 [3] and 
was followed shortly thereafter by the first complete 
sequence of a eukaryote, Saccharomyces cervisiae [4]. 
The development of dramatically improved sequenc- 
ing methodologies promises that complete elucida- 
tion of the Homo sapiens DNA sequence is not far 
behind [5]. 

To exploit more fully the wealth of new sequence 
information, it was necessary to develop novel meth- 
ods for the high-throughput or parallel monitoring 
of gene expression. Established methods such as 
northern blotting, RNAse protection assays, SI nu- 
clease analysis, plaque hybridization, and slot blots 
do not provide sufficient throughput to effectively 
utilize the new genomics resources. Newer methods 
such as differential display [6], high-density filter 
hybridization [7,8], serial analysis of gene expression 
[9], and cDNA- and oligonucleotide-based microarray 
"chip" hybridization [10-12] are possible solutions 
to this bottleneck. It is our belief that the microarray 
approach, which allows the monitoring of expres- 
sion levels of thousands of genes simultaneously, is 
a tool of unprecedented power for use in toxicology 
studies. 



Almost without exception, gene expression is al- 
tered during toxicity, as either a direct or indirect 
result of toxicant exposure. The challenge facing 
toxicologists is to define, under a given set of ex- 
perimental conditions, the characteristic and spe- 
cific pattern of gene expression elicited by a given 
toxicant. Microarray technology offers an ideal plat- 
form for this type of analysis and could be the foun- 
dation for a fundamentally new approach to 
toxicology testing. 

MICROARRAY DEVELOPMENT AND APPLICATIONS 
cDNA Microarrays 

In the past several years, numerous systems were 
developed for the construction of large-scale DNA 
arrays. All of these platforms are based on cDNAs 
or oligonucleotides immobilized to a solid sup- 
port. In the cDNA approach, cDNA (or genomic) 
clones of interest are arrayed in a multi-well for- 
mat and amplified by polymerase chain reaction. 
The products of this amplification, which are usu- 
ally 500- to 2000-bp clones from the 3' regions of 
the genes of interest, are then spotted onto solid 
support by using high-speed robotics. By using 
this method, microarrays of up to 10 000 clones 
can be generated by spotting onto a glass substrate 
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[13,14]. Sample detection for microarrays on glass 
involves the use of probes labeled with fluores- 
cent or radioactive nucleotides. 

Fluorescent cDNA probes are generated from con- 
trol and test RNA samples in single-round reverse-tran- 
scription reactions in the presence of fluorescently 
tagged dUTP (e.g., Cy3-dUTP and Cy5-dUTP), which 
produces control and test products labeled with dif- 
ferent fluors. The cDNAs generated from these two 
populations, collectively termed the "probe," are then 
mixed and hybridized to the array under a glass cov- 
erslip [10,11,15]. The fluorescent signal is detected 
by using a custom-designed scanning confocal mi- 
croscope equipped with a motorized stage and lasers 
forfluor excitation [10,11,15]. The data are analyzed 
with custom digital image analysis software that de- 
termines for each DNA feature the ratio of fluor 1 to 
fluor 2, corrected for local background [16,17]. The 
strength of this approach lies in the ability to label 
RNAs from control and treated samples with differ- 
ent fluorescent nucleotides, allowing for the simul- 
taneous hybridization and detection of both 
populations on one microarray. This method elimi- 
nates the need to control for hybridization between 
arrays. The research groups of Drs. Patrick Brown and 
Ron Davis at Stanford University spearheaded the 
effort to develop this approach, which has been suc- 
cessfully applied to studies of Arabidopsis thaliana 
RNA [10], yeast genomic DNA [15], tumorigenic ver- 
sus non-tumorigenic human tumor cell lines [11], 
human T-cells [18], yeast RNA [19], and human in- 
flammatory disease-related genes [20]. The most dra- 
matic result of this effort was the first published 
account of gene expression of an entire genome, that 
of the yeast Saccharomyces cervisiae [21]. 

In an alternative approach, large numbers of cDNA 
clones can be spotted onto a membrane support, al- 
beit at a lower density [7,22]. This method is useful 
for expression profiling and large-scale screening and 
mapping of genomic or cDNA clones [7,22-24]. In 
expression profiling on filter membranes, two dif- 
ferent membranes are used simultaneously for con- 
trol and test RNA hybridizations, or a single 
membrane is stripped and reprobed. The signal is 
detected by using radioactive nucleotides and visu- 
alized by phosphorimager analysis or autoradiogra- 
phy. Numerous companies now sell such cDNA 
membranes and software to analyze the image data 
[25-27]. 

Oligonucleotide Microarrays 

Oligonucleotide microarrays are constructed either 
by spotting prefabricated oligos on a glass support 
[13] or by the more elegant method of direct in situ 
oligo synthesis on the glass surface by photolithog- 
raphy [28-30]. The strength of this approach lies in 
its ability to discriminate DNA molecules based on 
single base-pair difference. This allows the applica- 
tion of this method to the fields of medical diagnos- 



tics, pharmacogenetics, and sequencing by hybrid- 
ization as well as gene-expression analysis. 

Fabrication of oligonucleotide chips by photoli- 
thography is theoretically simple but technically 
complex [29,30]. The light from a high-intensity 
mercury lamp is directed through a photolitho- 
graphic mask onto the silica surface, resulting in 
deprotection of the terminal nucleotides in the illu- 
minated regions. The entire chip is then reacted with 
the desired free nucleotide, resulting in selected chain 
elongation. This process requires only 4n cycles 
(where n = oligonucleotide length in bases) to syn- 
thesize a vast number of unique oligos, the total num- 
ber of which is limited only by the complexity of the 
photolithographic mask and the chip size [29,31,32]. 

Sample preparation involves the generation of 
double-stranded cDNA from cellular poly(A)+ RNA 
followed by antisense RNA synthesis in an in vitro 
transcription reaction with biotinylated or fluor- 
tagged nucleotides. The RNA probe is then frag- 
mented to facilitate hybridization. If the indirect 
visualization method is used, the chips are incubated 
with fluor-linked streptavidin (e.g., phycoerythrin) 
after hybridization [12,33]. The signal is detected with 
a custom confocal scanner [34]. This method has 
been applied successfully to the mapping of genomic 
library clones [35], to de novo sequencing by hybrid- 
ization [28,36], and to evolutionary sequence com- 
parison of the BRCA1 gene [37]. In addition, 
mutations in the cystic fibrosis [38] and BRCA1 [39] 
gene products and polymorphisms in the human im- 
munodeficiency virus- 1 clade B protease gene [40] 
have been detected by this method. Oligonucleotide 
chips are also useful for expression monitoring [33] 
as has been demonstrated by the simultaneous evalu- 
ation of gene-expression patterns in nearly all open 
reading frames of the yeast strain 5. cerevisiae [12]. 
More recently, oligonucleotide chips have been used 
to help identify single nucleotide polymorphisms in 
the human [41] and yeast [42] genomes. 

THE USE OF MICROARRAYS IN TOXICOLOGY 

Screening for Mechanism of Action 

The field of toxicology uses numerous in vivo 
model systems, including the rat, mouse, and rab- 
bit, to assess potential toxicity and these bioassays 
are the mainstay of toxicology testing. However, in 
the past several decades, a plethora of in vitro tech- 
niques have been developed to measure toxicity, 
many of which measure toxicant-induced DNA dam- 
age. Examples of these assays include the Ames test, 
the Syrian hamster embryo cell transformation as- 
say, micronucleus assays, measurements of sister 
chromatid exchange and unscheduled DNA synthe- 
sis, and many others. Fundamental to all of these 
methods is the fact that toxicity is often preceded 
by, and results in, alterations in gene expression. In 
many cases, these changes in gene expression are a 
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far more sensitive, characteristic, and measurable 
endpoint than the toxicity itself. We therefore pro- 
pose that a method based on measurements of the 
genome-wide gene expression pattern of an organ- 
ism after toxicant exposure is fundamentally infor- 
mative and complements the established methods 
described above. 

We are developing a method by which toxicants 
can be identified and their putative mechanisms of 
action determined by using toxicant-induced gene ex- 
pression profiles. In this method, in one or more de- 
fined model systems, dose and time-course parameters 
are established for a series of toxicants within a given 
prototypic class (e.g., polycyclic aromatic hydrocar- 
bons (PAHs)). Cells are then treated with these agents 
at a fixed toxicity level (as measured by cell survival), 
RNA is harvested, and toxicant-induced gene expres- 
sion changes are assessed by hybridization to a cDNA 
microarray chip (Figure 1). We have developed a cus- 
tom DNA chip, called ToxChip vl.O, specifically for 
this purpose and will discuss it in more detail below. 
The changes in gene expression induced by the test 
agents in the model systems are analyzed, and the 
common set of changes unique to that class of toxi- 
cants, termed a toxicant signature, is determined. 

This signature is derived by ranking across all ex- 
periments the gene-expression data based on rela- 

Control 
Population 



tive fold induction or suppression of genes in treated 
samples versus untreated .controls and selecting the 
most consistently different signals across the sample 
set. A different signature may be established for each 
prototypic toxicant class. Once the signatures are de- 
termined, gene-expression profiles induced by un- 
known agents in these same model systems can then 
be compared with the established signatures. A match 
assigns a putative mechanism of action to the test 
compound. Figure 2 illustrates this signature method 
for different types of oxidant stressors, PAHs, and 
peroxisome proliferators. In this example, the un- 
known compound in question had a gene-expres- 
sion profile similar to that of the oxidant stressors in 
the database. We anticipate that this general method 
will also reveal cross talk between different pathways 
induced by a single agent (e.g., reveal that a com- 
pound has both PAH-like and oxidant-like proper- 
ties). In the future, it may be necessary to distinguish 
very subtle differences between compounds within 
a very large sample set (e.g., thousands of highly simi- 
lar structural isomers in a combinatorial chemistry 
library or peptide library). To generate these highly 
refined signatures, standard statistical clustering tech- 
niques or principal-component analysis can be used. 

For the studies outlined in Figure 2, we developed 
the custom cDNA microarray chip ToxChip vl.O. 
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Figure 1. Simplified overview of the method for sample 
preparation and hybridization to cDNA microarrays. For illus- 



trative purposes, samples derived from cell culture are depicted, 
although other sample types are amenable to this analysis. 



156 



NUWAYSIR ETAL 



Known Agents 



PolycycKc Aromatic Peroxisome 
Oxidant Stressors Hydrocarbons Proliferators 



} Suspected 
Toxicant 




< i roup H — 



0000000 
0000000 



0000000 
••••00* 

0000000 



0000000 
0000000 





Toxicant 
Signature 



oo«o#oo 
o«ooo«o 



No Match « 



-o- No Match - 



Match 



Figure 2. Schematic representation of the method for iden- 
tification of a toxicant's mechanism of action. In this method, 
gene-expression data derived from exposure of model sys- 
tems to known toxicants are analyzed, and a set of changes 
characteristic to that type of toxicant (termed the toxicant 
signature) is identified. As depicted, oxidant stressors produce 



consistent changes in group A genes (indicated by red and 
green circles), but not group B or C genes (indicated by gray 
circles). The set of gene-expression changes elicited by the 
suspected toxicant is then compared with these characteristic 
patterns, and a putative mechanism of action is assigned to 
the unknown agent. 



The 2090 human genes that comprise this subarray 
were selected for their well-documented involve- 
ment in basic cellular processes as well as their re- 
sponses to different types of toxic insult. Included 
on this list are DNA replication and repair genes, 
apoptosis genes, and genes responsive to PAHs and 
dioxin-like compounds, peroxisome proliferators, 
estrogenic compounds, and oxidant stress. Some of 
the other categories of genes include transcription 
factors, oncogenes, tumor suppressor genes, cyclins, 
kinases, phosphatases, cell adhesion and motility 
genes, and homeobox genes. Also included in this 
group are 84 housekeeping genes, whose hybridiza- 
tion intensity is averaged and used for signal nor- 
malization of the other genes on the chip. To date, 
very few toxicants have been shown to have appre- 
ciable effects on the expression of these housekeep- 
ing genes. However, this housekeeping list will be 
revised if new data warrant the addition or deletion 
of a particular gene. Table 1 contains a general de- 
scription of some of the different classes of genes 
that comprise ToxChip vl .0. 

When a toxicant signature is determined, the 
genes within this signature are flagged within the 
database. When uncharacterized toxicants are then 
screened, the data can be quickly reformatted so that 
blocks of genes representing the different signatures 



are displayed [11]. This facilitates rapid, visual in- 
terpretation of data. We are also developing Tox- 
Chip v2.0 and chips for other model systems, 
including rat, mouse, Xenopus, and yeast, for use in 
toxicology studies. 

Animal Models in Toxicology Testing 

The toxicology community relies heavily on the 
use of animals as model systems for toxicology test- 
ing. Unfortunately, these assays are inherently ex- 
pensive, require large numbers of animals and take a 
long time to complete and analyze. Therefore, the 
National Institute of Environmental Health Sciences 
(NIEHS), the National Toxicology Program, and the 
toxicology community at large are committed to re- 
ducing the number of animals used, by developing 
more efficient and alternative testing methodologies. 
Although substantial progress has been made in the 
development of alternative methods, bioassays are 
still used for testing endpoints such as neurotoxic- 
ity, immunotoxicity, reproductive-and developmen- 
tal toxicology, and genetic toxicology. The rodent 
cancer bioassay is a particularly expensive and time- 
consuming assay, as it requires almost 4 yr, 1200 
animals, and millions of dollars to execute and ana- 
lyze [43]. In vitro experiments of the type outlined 
in Figure 2 might provide evidence that an unknown 
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Table 1. ToxChip v1.0: A Human cDNA Microarray 
Chip Designed to Detect Responses to Toxic insult 

No. of genes 



Gene category on chip 



Apoptosis 72 

DNA replication and repair 99 

Oxidative stress/redox homeostasis 90 

Peroxisome proliferator responsive 22 

Dioxin/PAH responsive 12 

Estrogen responsive 63 

Housekeeping 84 

Oncogenes and tumor suppressor genes 76 

Cell-cycle control 51 

Transcription factors 131 

Kinases 276 

Phosphatases 88 

Heat-shock proteins 23 

Receptors 349 

Cytochrome P450s 30 



*This list is intended as a general guide. The gene categories are not 
unique, and some genes are listed in multiple categories. 

agent is (or is not) responsible for eliciting a given 
biological response. This information would help to 
select a bioassay more specifically suited to the agent 
in question or perhaps suggest that a bioassay is not 
necessary, which would dramatically reduce cost, 
animal use, and time. 

The addition of microarray techniques to stan- 
dard bioassays may dramatically enhance the sen- 
sitivity and interpretability of the bioassay and 
possibly reduce its cost. Gene-expression signatures 
could be determined for various types of tissue-spe- 
cific toxicants, and new compounds could be 
screened for these characteristic signatures, provid- 
ing a rapid and sensitive in vivo test. Also, because 
gene expression is often exquisitely sensitive to low 
doses of a toxicant, the combination of gene-expres- 
sion screening and the bioassay might allow the use 
of lower toxicant doses, which are more relevant to 
human exposure levels, and the use of fewer ani- 
mals. In addition, gene-expression changes are nor- 
mally measured in hours or days, not in the months 
to years required for tumor development. Further- 
more, microarrays might be particularly useful for 
investigating the relationship between acute and 
chronic toxicity and identifying secondary effects 
of a given toxicant by studying the relationship 
between the duration of exposure to a toxicant and 
the gene-expression profile produced. Thus, a bio- 
assay that incorporates gene-expression signatures 
with traditional endpoints might be substantially 
shorter, use more realistic dose regimens, and cost 
substantially less than the current assays do. 

These considerations are also relevant for branches 
of toxicology not related to human health and not 
using rodents as model systems, such as aquatic toxi- 
cology and plant pathology. Bioassays based on the 
flathead minnow, Daphnia, and Arabadopsis could 
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also be improved by the addition of microarray analy- 
sis. The combination of microarrays with traditional 
bioassays might also be useful for investigating some 
of the more intractable problems in toxicology re- 
search, such as the effects of complex mixtures and 
the difficulties in cross-species extrapolation. 

Exposure Assessment, Environmental Monitoring, 
and Drug Safety 

The currently used methods for assessment of ex- 
posure to chemical toxicants are based on measure- 
ment of tissue toxin levels or on surrogate markers 
of toxicity, termed biomarkers (e.g., peripheral blood 
levels of hepatic enzymes or DNA adducts). Because 
gene expression is a sensitive endpoint, gene expres- 
sion as measured with microarray technology may 
be useful as a new biomarker to more precisely iden- 
tify hazards and to assess exposure. Similarly, 
microarrays could be used in an environmental- 
monitoring capacity to measure the effect of poten- 
tial contaminants on the gene-expression profiles 
of resident organisms. In an analogous fashion, 
microarrays could be used to measure gene-expres- 
sion endpoints in subjects in clinical trials. The com- 
bination of these gene-expression data and more 
established toxic endpoints in these trials could be 
used to define highly precise surrogates of safety. 

Gene-expression profiles in samples from exposed 
individuals could be compared to the profiles of the 
same individuals before exposure. From this infor- 
mation, the nature of the toxic exposure can be de- 
termined or a relative clinical safety factor estimated. 
In the future it may also be possible to estimate not 
only the nature but the dose of the toxicant for a 
given exposure, based on relative gene-expression 
levels. This general approach may be particularly 
appropriate for occupational-health applications, in 
which unexposed and exposed samples from the 
same individuals may be obtainable. For example, 
a pilot study of gene expression in peripheral-blood 
lymphocytes of Polish coke-oven workers exposed 
to PAHs (and many other compounds) is under con- 
sideration at the NIEHS. An important consideration 
for these types of studies is that gene expression can 
be affected by numerous factors, including diet, 
health, and personal habits. To reduce the effects 
of these confounding factors, it may be necessary 
to compare pools of control samples with pools of 
treated samples. In the future it may be possible to 
compare exposed sample sets to a national database 
of human-expression data, thus eliminating the 
need to provide an unexposed sample from the same 
individual. Efforts to develop such a national gene- 
expression database are currently under way [44,45]. 
However, this national database approach will re- 
quire a better understanding of genome-wide gene 
expression across the highly diverse human popu- 
lation and of the effects of environmental factors 
on this expression. 
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Alleles, Oligo Arrays, and Toxicogenetics 

Gene sequences vary between individuals, and 
this variability can be a causative factor in human 
diseases of environmental origin [46,47]. A new area 
of toxicology, termed toxicogenetics, was recently 
developed to study the relationship between genetic 
variability and toxicant susceptibility. This field is 
not the subject of this discussion, but it is worth- 
while to note that the ability of oligonucleotide ar- 
rays to discriminate DNA molecules based on single 
base-pair differences makes these arrays uniquely 
useful for this type of analysis. Recent reports dem- 
onstrated the feasibility of this approach [41,42]. 
The NIEHS has initiated the Environmental Genome 
Project to identify common sequence polymor- 
phisms in 200 genes thought to be involved in en- 
vironmental diseases [48]. In a pilot study on the 
feasibility of this application to the Environmental 
Genome Project, oligonucleotide arrays will be used 
to resequence 20 candidate genes. This toxicogenetic 
approach promises to dramatically improve our un- 
derstanding of interindividual variability in disease 
susceptibility. 

FUTURE PRIORITIES 

There are many issues that must be addressed be- 
fore the full potential of microarrays in toxicology 
research can be realized. Among these are model sys- 
tem selection, dose selection, and the temporal na- 
ture of gene expression. In other words, in which 
species, at what dose, and at what time do we look 
for toxicant-induced gene expression? If human 
samples are analyzed, how variable is global gene 
expression between individuals, before and after toxi- 
cant exposure? What are the effects of age, diet, and 
other factors on this expression? Experience, in the 
form of large data sets of toxicant exposures, will 
answer these questions. 

One of the most pressing issues for array scientists 
is the construction of a national public database 
(linked to the existing public databases) to serve as a 
repository for gene-expression data. This relational 
database must be made available for public use, and 
researchers must be encouraged to submit their ex- 
pression data so that others may view and query the 
information. Researchers at the National Institutes 
of Health have made laudable progress in develop- 
ing the first generation of such a database [44,45]. In 
addition, improved statistical methods for gene clus- 
tering and pattern recognition are needed to ana- 
lyze the data in such a public database. 

The proliferation of different platforms and meth- 
ods for microarray hybridizations will improve 
sample handling and data collection and analysis and 
reduce costs. However, the variety of microarray 
methods available will create problems of data com- 
patibility between platforms. In addition, the near- 
infinite variety of experimental conditions under 



which data will be collected by different laborato- 
ries will make large-scale data analysis extremely dif- 
ficult. To help circumvent these future problems, a 
set of standards to be included on all platforms 
should be established. These standards would facili- 
tate data entry into the national database and serve 
as reference points for cross-platform and inter-labo- 
ratory data analysis. 

Many issues remain to be resolved, but it is clear 
that new molecular techniques such as microarray 
hybridization will have a dramatic impact on toxicol- 
ogy research. In the future, the information gathered 
from microarray-based hybridization experiments will 
form the basis for an improved method to assess the 
impact of chemicals on human and environmental 
health. 
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Abstract 

Recent progress in genomics and proteomics technoloeies has created * ■.„;„... « ., 
the pharmaceutical drug development processes The f™, ITS , » oPPortumty to significantly impaci 
inducible responses to stimuli such a iJ^Z^lnv^ZtuT * ^ Wh0,e ° rga " iSmS ^ ^ 
indicative of a drug's efficacy and potential ™f'™P™ s ™ P^'ems. molecular fingerprints, 

assays allowing one to profile treatment -reS ^^^^^ of 
mechanisms of drug action and toxicity. The benefits will be CoS S I *le2 ^h' 0 ™" T in, ° 
drug efficacy and safety in pre-clinical and clinical studies basedTb 10 l 0< nS.v 7 opum.zed mon.toring of 
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1. Introduction 

The majority of drugs act by binding to protein 
targets, most to known proteins representing en- 
zymes, receptors and channels, resulting in effects 
such as enzyme inhibition and impairment of 
signal transduction. The treatment-induced per- 
turbations provoke feedback reactions aiming to 
compensate for the stimulus, which almost always 
are associated with signals to the nucleus, result- 
ing in altered gene expression. Such gene expres- 
sion regulations account for both the 
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pharmacological action and the toxicity of a drug 
and can be visualized by either global' mRNA or 
global protein expression profiling. Hence, for 
each individual drug, a characteristic gene regula- 
tion pattern, its molecular fingerprint, exists 
which bears valuable information on its mode of 
action and its mechanism of toxicity. 

Gene expression is a multistep process that 
results in an active protein (Fig. 1 ). There exist 
numerous regulation systems that exert control at 
and after the transcription and the translation 
step. Genomics, by definition, encompasses the 
quantitative analysis of transcripts at the mRNA 
level, while the aim of proteomics is to quantify 
gene expression further down-stream, creating a 
snapshot of gene regulation closer to ultimate cell 
function control. 
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2. Global mRNA profiling , ~, . , 

J- Global protein profiling 



Expression data at the mRNA level can be 
produced using a set of different technologies 
such as DNA microarrays, reverse transcript 
imaging, amplified fragment length polymorphism 

,vfi2' SCnal analysis of « ene expression 
(SAGE) and others. Currently, DNA microarrays 
are very popular and promise a great potential 
. On a typical array, each gene of interest is repre- 
sented either by a long DNA fragment (200-2400 
bp) typically generated by polymerase chain reac- 
tion (PCR) and spotted on a suitable substrate 
using robotics (Schena et al., 1995; Shalon et al.. 
1996) or by several short oligonucleotides (20-30 
bp) synthesized directly onto a solid support usin° 
photolabile nucleotide chemistry (Fodor et al* 
1991; Chee et al., 1996). From control and treated 
tissues, total RNA or mRNA is isolated and 
reverse transcribed in the presence of radioactive 
or fluorescent labeled nucleotides, and the labeled 
probes are then hybridized to the arrays. The 
intensity of the array signal is measured for each 
gene transcript by either autoradiography or laser 
scanning confocal microscopy. The ratio between 
the signals of control and treated samples reflect 
the relative drug-induced change in transcript 
abundance. 



Global quantitative expression analysis at the 
protein level is currently restricted to the use of 
two-dimensional gel electrophoresis. This tech- 
nique combines separation of tissue proteins bv 
isoelectric focusing in the first dimension and bv 
sod.um dodecyl sulfate slab gel electrophoresis'- 
based molecular weight separation on the second 
orthogonal dimension (Anderson et al 199]) 
The product is a rectangular pattern of protein 
spots that are typically revealed by Coomassie 
Blue, silver or fluorescent staining (Fi* ^) 
Protein spots are identified by mass spectromet"ry 
following generation of peptide mass fingerprints 

t^ooS a c- ! ? 3) 3nd Sequence ta * s Wlkins et 
al.. 1996). Similar to the mRNA approach, the 
ratio between the optical density of spots from 
control and treated samples are compared to 
search for treatment-related changes. 

4. Expression data analysis 

Bioinformatics forms a key element required to 
organize, analyze and store expression data from 
either source, the mRNA or the protein level The 
overall objective, once a mass of high-quality 
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quantitative expression data has been collected is 
to visualize complex patterns of gene expression 
changes, to detect pathways and sets of genes 
tightly correlated with treatment efficacy andtoxi- 
city, and to compare the effects of different sets of 
treatment (Anderson et al., 1996). As the drug 
effect database is growing, one may detect similar- 
ities and differences between the molecular fineer- 
pnnts produced by various drugs, information 
that may be crucial to make a decision whether to 
refocus or extend the therapeutic spectrum of a 
drug candidate. 



5. Comparison of global mRNA and protein 
expression profiling 

There are several synergies and overlaps of data 
obtained by mRNA and protein expression analy- 
sis. Low abundant transcripts may not be easily 
quantified at the protein level usine standard two- 
dimensional gel electrophoresis analysis and their 
detection may require prefractionation of sam- 
ples. The expression of such genes may be prefer- 
ably quantified at the mRNA level using 
techniques allowing PCR-mediated target amplifi- 
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cation: Tissue biopsy samples tvpicaMy yield good 
quality of both mRNA and proteins; however^ the 
quality of mRNA isolated from body fluids is 
often poor due to the faster degradation of 
mRNA when compared with proteins. RNA sam- 
ples from body fluids such as serum or urine are 
often not very meaningful', and secreted proteins 
are likely more reliable surrogate markers for 
treatment efficacy and safety. Detection of post- 
translational modifications; events often related to 
function or nonfunction of a protein, is restricted 
to protein expression analysis and rarely can be 
predicted by mRNA profiling. Information on 
subcellular localization and translocation of 
proteins has to be acquired at. the level of the 
protein in combination with sample prefractiona- 
tion procedures. The growing evidence of a poor 
correlation between mRNA and protein abun- 
dance (Anderson and Seilhamer, 1997) further 
suggests that the two approaches, mRNA and 
protein profiling, are complementary and should 
be applied in parallel. 

6. Expression profiling and drug development 

Understanding the mechanisms of action and 
toxicity, and being able to monitor treatment 
efficacy and safety during trials is crucial for the 
successful development of a drug. Mechanistic 
insights are essential for the interpretation of drua 
effects and enhance the chances of recognizing 
potential species specificities contributing to an 
improved risk profile in humans (Richardson et 
al., 1993; Steiner et al.. 1996b; Aicher et a!., 1998). 
The value of expression profiling further increases 
when links between treatment-induced expression 
profiles and specific pharmacological and toxic 
endpoints are established (Anderson et al 1991 
1995. 1996: Steiner et ai. 1996a). Changes in gene 
expression are known to precede the manifesta- 
tion of morphological alterations, giving expres- 
sion profiling a great potential for early 
compound screening, enabling one to select drug 
candidates with wide therapeutic windows 
reflected by molecular fingerprints indicative of 
high pharmacological potency and low toxicity 
(Arce et al., 1998). In later phases of drug devel- 
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opment, surrogate markers of treatment efficacy 
and toxicity can be applied to optimize the moni- 
toring of pre-clmical and clinical studies (Dohertv 
et al.v 1998): y 



7. Perspectives 

_ The basic methodology of safety evaluation has 
changed little during the past decades. Toxicity in 
laboratory animals has been evaluated primarily 
by using hematological, clinical chemistry and 
histological parameters as indicators of organ 
damage. The rapid progress in genomics arid pro- 
teomics technologies creates a unique opportunity 
to dramatically improve the predictive power of 
safety assessment and to accelerate the drug devel- 
opment process. Application of gene and protein 
expression profiling promises to improve lead se- 
lection, resulting in the development of drug can- 
didates with higher efficacy and lower toxicity. 
The identification of biologically relevant surro- 
gate markers correlated with treatment efficacv 
and safety bears a great potential to optimize the 
monitoring of preclinical and clinical trails. 
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DNA array technology makes it possible to rapidly genotype individuals or quantify the expression 
of thousands of genes on a single filter or glass slide, and holds enormous potential in toxicologic 
applications. This potential led to a U.S. Environmental Protection Agency-sponsored workshop 
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microarrays to toxicologic research and risk assessment include genome-wide expression analyses to 
identify gene-expression networks and toxicant-specific signatures that can be used to define mode 
of action, for exposure assessment, and for environmental monitoring. Arrays may also prove useful 
for monitoring genetic variability and its relationship to toxicant susceptibility in human popula- 
tions. Key words-. DNA arrays, gene arrays, microarrays, toxicology. Environ Health Perspect 
107:681-685(1999). [Online 6 July 1999] 
http://etn>rtetl.niefanib.gav/a^ 



Decoding the genetic blueprint is a dream that 
offers manifold returns in terms of understand- 
ing how organisms develop and function in an 
often hostile environment. With the rapid 
advances in molecular biology over the last 30 
years, the dream has come a step closer to reali- 
ty. Molecular biologists now have the ability to 
elucidate the composition of any genome. 
Indeed, almost 20 genomes have already been 
sequenced and more than 60 are currently 
under way. Foremost among these is the 
Human Genome Mapping Project. However, 
the genomes of a number of commonly used 
laboratory species are also under intensive 
investigation, including yeast, Arabidopsis, 
maize, rice, zebra fish, mouse, rat, and dog. It 
is widely expected that the completion of such 
programs will facilitate the development of 
many powerful new techniques and approach- 
es to diagnosing and treating genetically and 
environmentally induced diseases which afflict 
mankind. However, the vast amount of data 
being generated by genome mapping will 
require new high-throughput technologies to 
investigate the function of the millions of new 
genes that are being reported Among the most 
widely heralded of the new functional 
genomics technologies are DNA arrays, which 
represent perhaps the most anticipated new 
molecular biology technique since polymerase 
chain reaction (PCR). 

Arrays enable the study of literally thou- 
sands of genes in a single experiment. The 
potential importance of arrays is enormous and 
has been highlighted by the recent publication 
of an entire Nature Genetics supplement dedi- 
cated to the technology (2). Despite this huge 
surge of interest, DNA arrays are still little used 
and largely unproven, as demonstrated by the 
high ratio of review and press articles to actual 
data papers. Even so, the. potential they offer 



has driven venture capitalists into a frenzy of 
investment and many new companies are 
springing up to claim a share of this rapidly 
developing market. 

The U.S. Environmental Protection 
Agency (EPA) is interested in applying DNA 
array technology to ongoing toxicologic stud- 
ies. To learn more about the current state of 
the technology, the Reproductive Toxicology 
Division (RTD) of the National Health and 
Environmental Effects Research Laboratory 
(NHEERL; Research Triangle Park, NC) 
hosted a workshop on "Application of 
Microarrays to Toxicology" on 7-8 January 
1999 in Research Triangle Park, North 
Carolina. The workshop was organized by 
David Dix, Robert Kavlock, and John Rockett 
of the RTD/NHEERL. Twenty-two intra- 
mural and extramural scientists from govern- 
ment, academia, and industry shared informa- 
tion, data, and opinions on the current and 
future applications for this exciting new tech- 
nology. The workshop had more than 150 
attendees, including researchers, students, and 
administrators from the EPA, the National 
Institute of Environmental Health Sciences 
(NIEHS), and a number of other establish- 
ments from Research Triangle Park and 
beyond. Presentations ranged from the tech- 
nology behind array production through the 
sharing of actual experimental data and projec- 
tions on the future importance and applica- 
tions of arrays. The information contained in 
the workshop presentations should provide aid 
and insight into arrays in general and their 
application to toxicology in particular. 

Array Elements 

In the context of molecular biology, the word 
"array" is normally used to refer to a series of 
DNA or protein elements firmly attached in 



a regular pattern to some kind of supportive 
medium. DNA array is often used inter- 
changeably with gene array or microarray. 
Although not formally defined, microarray is 
generally used to describe the higher density 
arrays typically printed on glass chips. The 
DNA elements that make up DNA arrays 
can be oligonucleotides, partial gene 
sequences, or full-length cDNAs. Companies 
offering pre-made arrays that contain less 
than full-length clones normally use regions 
of the genes which are specific to that gene to 
prevent false positives arising through cross- 
hybridization. Sequence verification of 
cDNA done identity is necessary because of 
errors in identifying specific clones from 
cDNA libraries and databases. Premade 
DNA arrays printed on membranes are cur- 
rently or imminently available for human, 
mouse, and rat. In most cases they contain 
DNA sequences representing; several thou- 
sand different sequence clusters or genes as 
delineated through the National Center for 
Biotechnology Information UniGene Project 
(2). Many of these different UniGene clusters 
(putative genes) are represented only by 
expressed sequence tags (ESTs). 

Array Printing 

Arrays are typically printed on one of two 
types of support matrix. Nylon membranes 
are used by most off-the-shelf array providers 
such as Clontech Laboratories, Inc. 
(Palo Alto, CA), Genome Systems, Inc. (St. 
Louis, MO), and Research Genetics, Inc. 
(Huntsville, AL). Microarrays such as those 
produced by Affymetrix, Inc. (Santa Clara, 
CA), Incyte Pharmaceuticals, Inc. (Palo Alto, 
CA), and many do-it-yourself (DIY) arraying 
groups use glass wafers or slides. Although 
standard microscope slides may be used, they 
must be preprepared to facilitate sticking 
of the DNA to the glass. Several different 
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coatings have been successfully used, includ- 
ing silane and lysine. The coating of slides 
can easily be carried out in the laboratory, 
but many prefer the convenience of precoated 
slides available from suppliers. 

Once the support matrix has been pre- 
pared, the DNA elements can be applied by 
several methods. Asymetrix, Inc., has devel- 
oped a unique photolithographic technology 
for attaching oligonucleotides to glass wafers. 
More commonly, DNA is applied by either 
noncontact or contact printing. Nonconract 
printers can use thermal, solenoid, or piezoelec- 
tric technology to spray aliquots of solution 
onto the support matrix and may be used to 
produce slide or membrane-based arrays. 
Cartesian Technologies, Inc. (Irvine, CA) has 
developed nQUAD technology for use in its 
PixSys printers. The system couples a syringe 
pump with the miaosolenoid valve, a combi- 
nation that provides rapid quantitative dispens- 
ing of nanoliter volumes (down to 42 nL) over 
a variable volume range. A different approach 
to noncontact printing uses a solid pin and ring 
combination (Genetic MicroSystems, Inc., 
Woburn, MA). This system (Figure 1) allows a 
broader range of sample, including cell suspen- 
sions and particulates, because the printing 
head cannot be blocked up in the same way as 
a spray nozzle. Fluid transfer is controlled in 
this system primarily by the pin dimensions 
and the force of deposition, although the 
nature of the support matrix and the sample 
will also affect transfer to some degree. 

In contact printing, the pin head is dipped 
in the sample and then touched to the support 
matrix to deposit a small aliquot. Split pins 
were one of the first contaa-prmting devices 
to be reported and are the suggested format 
for DIY arrayers, as described by Brown (3). 
Split pins are small metal pins with a precise 
groove cut vertically in the middle of die pin 
tip. In this system, 1-48 split pins are posi- 
tioned in the pin-head. The split pins work by 
simple capillary action, not unlike a fountain 
pen — when the pin heads are dipped in the 
sample, liquid is drawn into the pin groove. A 
small (fixed) volume is then deposited each 
time the split pins are gently touched to 
the support matrix. Sample (100-500 pL 
depending on a variety of parameters) can be 
deposited on multiple slides before refilling is 
required, and array densities of > 2,500 
spots/cm 2 may be produced. The deposit vol- 
ume depends on the split size, sample fluidi- 
ty, and the speed of printing. Split pins are 
relatively simple to produce and can be made 
in-house if a suitable machine shop is avail- 
able. Alternatively, they can be obtained 
directly from companies such as TeleChem 
International, Inc. (Sunnyvale, CA). 

Irrespective of their source, printers 
should be run through a preprint sequence 
prior to producing the actual experimental 



arrays; the first 100 or so spots of a new run 
tend to be somewhat variable. Factors effect- 
ing spot reproducibility include slide treat- 
ment homogeneity, sample differences, and 
instrument errors. Other factors that come 
into play include clean ejection of the drop 
and clogging (nQUAD printing) and 
mechanical variations and long-term alter- 
ation in print-head surface of solid and split 
pins. However, with careful preparation it is 
possible to get a coefficient of variance for 
spot reproducibility below 10%. 

One potential printing problem is sample 
carryover. Repeated washing, blotting, and 
drying (vacuum) of print pins between samples 
is normally effective at reducing sample carry- 
over to negligible amounts. Printing should 
also be carried out in a controlled environ- 
ment. Humidified chambers are available in 
which to place printers. These help prevent 
dust contamination and produce a uniform 
drying rate, which is important in determining 
spot size, quality, and reproducibility. 

In summary, although several printing 
technologies are available, none are par- 
ticularly outstanding and the bottom line 
is that they are still in a relatively early stage 
of evolution. 

Array Hybridization 

The hybridization protocol is, practically 
speaking, relatively straightforward and those 
with previous experience in blotting should 
have little difficulty. Array hybridizations 
are, in essence, reverse Southern/Northern 
blots — instead of applying a labeled probe to 
the target population of DNA/RNA, the 
labeled population is applied to the probe(s). 
With membrane-based arrays, the control and 
treated mRNA populations are normally con- 
verted to cDNA and labeled with isotope (eg., 
33 P) in the process. These labeled populations 
are then hybridized independendy to parallel 
or serial arrays and the hybridization signal is 
detected with a phosporimager. A less com- 
monly used alternative to radioactive probes is 
enzymatic detection. The probe may be 
biotinylatcd, haptenylated, or have alkaline 
phospharasc/horseradish peroxidase attached. 
Hybridization is detected by enzymatic reac- 
tion yielding a color reaction (4). Differences 
in hybridization signals can be detected by eye 
or, more accurately, with the help of digital 
imaging and commercially available software. 
The labeling of the test populations for slide- 
based microarrays uses a slightly different 
approach. The probe typically consists of two 
samples of pofyA* RNA (usually from a treated 
and a control population) that are converted to 
cDNA; in the process each is labeled with a 
different fluor. The independently labeled 
probes are then mixed together and hybridized 
to a single microarray slide and the resulting 
combined fluorescent signal is scanned. After 
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Rgure 1. Genetic Microsystems (Wo bum, MA) pin 
ring system for printing arrays. The pin ring com- 
bination consists of a circular open ring oriented 
parallel to the sample solution, with a vertical pin 
centered over the ring. When the ring is dipped 
into a solution and lifted, it withdraws an aliquot 
of sample held by surface tension. To spot the 
sample, the pin is driven down through the ring 
and a portion of the solution is transferred to the 
bottom of the pin. The pin continues to move 
downward until the pendant drop of solution 
makes contact with the underlying surface. The 
pin is then lifted, and gravity and surface tension 
cause deposition of the spot onto the array. 
Figure from Rowers et al. (74), with permission 
from Genetic Microsystems. 

normalization, it is possible to determine the 
ratio of fluorescent signals from a single 
hybridization of a slide-based microarray. 

cDNA derived from control and treated 
populations of RNA is most commonly 
hybridized to arrays, although subtractive 
hybridization or differential display reactions 
may also be used. Fluorophore- or radiola- 
beled nucleotides are directly incorporated 
into the cDNA in the process of converting 
RNA to cDNA Alternatively, 5' end-labeled 
primers may be used for cDNA synthesis. 
These are labeled with a fluorophore for 
direct visualization of the hybridized array. 
Alternatively, biotin or a hapten may be 
attached to the primer, in which case fiuor- 
labeled streptavidin or antibody must be 
applied before a signal can be generated. The 
most commonly used fluorophores at present 
are cyanine (Cy)3 and Cy5 (Amersham 
Pharmacia Biotech AB, Uppsala, Sweden). 
However, the relative expense of these fluo- 
rescent conjugates has driven a search for 
cheaper alternatives. Fluorescein, rhodamine, 
and Texas red have all been used, and 
companies such as Molecular Probes, Inc. 
(Eugene, OR) are developing a series of 
labeled nucleotides with a wide range of exci- 
tation and emission spectra which may prove 
to function as well as the Cy dyes. 



682 



Volume 1 07. Number 8, August 1999 • Envlronrnental Health Perspectives 



Workshop Summary ■ Application of DNA arrays to toxicology 



Table 1. Advantages and disadvantages of different microarray scanning systems. 



Nonconfocal laser scanner 


Advantages 
Disadvantages 


Few moving parts 

Fast scanning of bright 
samples 

Less appropriate for dim 
samples 

Optical scatter can limit 
performance 


Relatively simple optics 

Low light collection efficiency 
Background artifacts not rejected 
Resolution typically low 


Small depth of focus reduces 
artifacts 

May have high light collection 
efficiency 

Small depth of focus requires 
scanning precision 



— L 



Analysis of DNA Microarrays 

Membrane-based arrays are normally analyzed 
on film or with a phosphorimager, whereas 
chip-based arrays require more specialized scan- 
ning devices. These can be divided into three 
main groups: the charge-coupled device camera 
systems, the nonconfocal laser scanners, and the 
confocal laser scanners. The advantages and dis- 
advantages of each system are listed in Table 1. 

Because a typical spot on a microarray can 
contain > 10 s molecules, it is dear that a large 
variation in signal strength may occur. 
Current scanners cannot work across this 
many orders of magnitude (4 or 5 is more typ- 
ical). However, the scanning parameters can 
normally be adjusted to collect more or less 
signal, such that two or three scans of the same 
array should permit the detection of rare and 
abundant genes. 

When a microarray is scanned, the fluores- 
cent images are captured by software normally 
included with the scanner. Several commercial 
suppliers provide additional software for quan- 
tifying array images, but the software tools are 
constantly evolving to meet the developing 
needs of researchers, and it is prudent to 
define one's own needs and clarify the exact 
capabilities of the software before its purchase. 
Issues that should be considered include the 
following: 

• Can the software locate offset spots? 

• Can it quantitare across irregular hybridiza- 
tion signals? 

• Can the arrayed genes be programmed in for 
easy identification and location? 

• Can the software connect via the Internet to 
databases containing further information on 
die gene(s) of interest? 

One of the key issues raised at the work- 
shop was the sensitivity of microarray technol- 
ogy. Experiments by General Scanning, Inc. 
(Watertown, MA), have shown that by using 
the Cy dyes and their scanner, signal can be 
detected down to levels of < 1 fluor molecule 
per square micrometer, which translates to 
detecting a rare message at approximately one 
copy per cell or less. 

Array Applications 

Although arrays are an emerging technology 
certain to undergo improvement and 
alteration^ they have already been applied use- 
fully to a number of model systems. Arrays are 
at their most powerful when they contain the 
entire genome of the species they are being 
used to study. For this reason, they have strong 
support among researchers utilizing yeast and 
Cacnorhabditis elegans The genomes of 
both of these species have been sequenced and, 
in the case of yeast, deposited onto arrays for 
examination of gene expression (6,7), With 
both of these species, it is relatively easy to 
perturb individual gene expression. Indeed, C 



CCD, charge-coupled device. 
From Kawasaki ( 73). 

elegans knockouts can be made simply by 
soaking the worms in an antisense solution of 
the gene to be knocked out. 

By a process of systematic gene disrup- 
tion, it is now possible to examine the cause 
and effect relationships between different 
genes in these simple organisms. This kind of 
approach should help elucidate biochemical 
pathways and genetic control processes, 
deconvolute polygenic interactions, and 
define the architecture of the cellular network. 
A simple case study of how this can be 
achieved was presented by Butow [University 
of Texas Southwestern Medical Center, 
Dallas, TX (Figure 2)]. Although it is the 
phenotypic result of a single gene knockout 
that is being examined, the effect of such 
perturbation will almost always be polygenic 
Polygenic mteractions will become increasing- 
ly important as researchers begin to move" 
away from single gene systems when examin- 
ing the nature of toxicologic responses to 
external stimuli. This is especially important 
in toxicology because the phenotype pro- 
duced by a given environmental insult is 
never the result of the action of a single gene; 
rather, it is a complex interaction of one or 
multiple cellular pathways. Phenomena such 
as quantitative trait (the continuous variation 
of phenotype), epistasis (the effect of alleles of 
one or more genes on the expression of other 
genes), and penetrance (proportion of indi- 
viduals of a given genotype that display a par- 
ticular phenotype) will become increasingly 
evident and important as toxicologists push 
toward the ultimate goal of matching the 
responses of individuals to different 
environmental stimuli. 

Analysis of the transcriptome (the expres- 
sion level of all the genes in a given cell popula- 
tion) was a use of arrays addressed by several 
speakers. Urifortunately, current gene nomen- 
clature is often confusing in that single genes 
are allocated multiple names (usually as a result 
of independent discovery by different laborato- 
ries), and there was a call for standardization of 
gene nomenclature. Nevertheless, once a tran- 
scriptome has been assembled it can then be 
transferred onto arrays and used to screen any 
chosen system. The EPA MicroArray 
Consortium (EPAMAQ is assembling testes 



transcriptomes for human, rat, and mouse. In a 
slightly different approach, Nuwaysir et al. (8) 
describes how the NIEHS assembled what is 
effectively a "toxicoiogical transcriptome w — a 
library of human and mouse genes that have 
previously been proven or implicated in 
responses to toxicologic insults. Clontech 
Laboratories, Inc. (Palo Alto, CA), has begun a 
similar process by developing stress/ toxicology 
filter arrays of rat, mouse, and human genes. 
Thus, rather than being tissue or cell specific, 
these stress/toxicology arrays can be used across 
a variety of model systems to look for alter- 
ations in the expression of toxicologically 
important genes and define the new field of 
toxicogenomics. The potential to identify toxi- 
cant families based on tissue- or cell-specific 
gene expression could revolutionize drug test- 
ing. These molecular signatures or fingerprints 
could not only point to the possible 
toxicity/carcinogenicity of newly discovered 
compounds (Figure 3), but also aid in elucidat- 
ing their mechanism of action through identifi- 
cation of gene expression networks. By exten- 
sion, such signatures could provide easily iden- 
tifiable biomarkers to assess the degree, time, 
and nature of exposure. 

DNA arrays are primarily a tool for exam- 
ining differential gene expression in a given 
model. In this context they are referred to as 
closed systems because they lack the ability of 
other differential expression technologies, eg., 
differential display and subtractrve hybridiza- 
tion, to detect previously unknown genes not 
present on the array. This would appear to 
limit the power of DNA arrays to the imagina- 
tions and preconceptions of the researcher in 
selecting genes previously characterized and 
thought to be involved in the model system. 
However, the various genome sequencing pro- 
jects have created a new category of 
sequence — the EST — that has partially molli- 
fied this deficiency. ESTs are cDNAs expressed 
in a given tissue that, although they may share 
some degree of sequence similarity to previous- 
ly characterized genes, have not been assigned 
specific genetic identity. By incorporating EST 
clones into an array, it is possible to monitor 
the expression of these unknown genes. This 
can enable the identification of previously 
uncharacterized genes that may have biologic 
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significance in the model system- Filter arrays 
from Research Genetics and slide arrays from 
Incyte Pharmaceuticals both incorporate large 
numbers ofESTs from a variety of species. 

A further use of micro arrays is the identifi- 
cation of single nucleotide polymorphisms 
(SNPs). These genomic variations are abun- 
dant — they occur approximately every 1 kb or 
so— and are the basis of restriction fragment 
length polymorphism analysis used in forensic 
analysis. Afrymetrix, Inc. designed chips that 
contain multiple repeats of the same gene 
sequence. Each position is present with all four 
possible bases. After the hybridization of the 
sample, the degree of hybridization to the dif- 
ferent sequences can be measured and the exact 
sequence of the target gene deduced. SNPs are 
thought to be of vital importance in drug 
metabolism and toxicology. For example, sin- 
gle base differences in the regulatory region or 
active site of some genes can account for huge 
differences in the activity of that gene. Such 
SNPs are thought to explain why some people 
are able to metabolize certain xenobiotics bet- 
ter than others. Thus, arrays provide a further 
tool for the toxicologist investigating the 
nature of susceptible subpopulations and toxi- 
cologic response. 

There are still many wrinkles to be ironed 
out before arrays become a standard tool for 
toxicologists. The main issues raised at the 
workshop by those with hands-on experience 
were the following: 

• Expense: the cost of purchasingftxmtracring 
this technology is still too great for many 
individual laboratories. 




Figure 2. Potential effects of gene knockout within 
positively and negatively regulated gene expression 
networks. /, is limiting in wild type for expression of 
if iA) A simple, two-component, linear regulatory 
network operating on gene ^, where ^ is a positive 
effector of % and j n is either a positive or negative 
effector of This network could be deduced by 
examining the consequence of (6) deleting j n on the 
expression of i\ and where the expression of ^ 
would be decreased or increased depending on 
whether j n was a positive or negative regulator. 
These end other connected components of even 
greater complexity could be revealed by genome- 
wide expression analysis. From Butow ( 75). 



► Clones: the logistics of identifying, obtaining, 
and maintaining a set of nonredundant, non- 
contaminated, sequence-verified, species/cell/ 
tissue/field-specific clones. 

1 Use of inbred strains: where whole-organism 
models axe being used, the use of inbred 
strains is important to reduce the potentially 
confusing effects of the individual variation 
typically seen in outbred populations. 

> Probe: the need for relatively large amounts 
of RNA, which limits the type of sample 
(eg., biopsy) that can be used. Also, different 
RNA extraction methods can give different 
results. 

1 Specificity: the ability to discriminate accu- 
rately between closely related genes (eg., the 
\ cytochrome p450 family) and splice variants, 
t Quantitation: the quantitation of gene 
\ expression using gene arrays is still open to 
I debate. One reason for this is the different 
incorporation of the labeling dyes. However, 
the main difficulty lies in knowing what to 
normalize against. One option is to include a 
large number of so-called housekeeping genes 
in the array. However, the expression of these 
genes often change depending on the tissue 
and the toxicant, so it is necessary to charac- 
terize the expression of these genes in the 
model system before utilizing them. This is 
clearly not a viable option when screening 
multiple new compounds. A second option 
is to include on the array genes from a nonre- 
lated species (eg., a plant gene on an animal 
array) and to spike the probe with synthetic 
RNA(s) complementary to the gene(s). 
• Reproducibility: this is sometimes question- 
able, and a figure of approximately two or 
three repeats was used as the minimum num- 
ber required to confirm initial findings. 



Again, however, most people advocated the 
use of Northern blots or reverse transcriptase 
PCR to confirm findings. 

• Sensitivity: concerns were voiced about the 
number of target molecules that must be pre- 
sent in a sample for them to be detected on 
the array. 

• Efficiency: reproducible identification of 1.5- 
to 2-fold differences in expression was report- 
ed, although the number of genes that 
undergo this level of change and remain 
undetected is open to debate. It is important 
that this level of detection be ultimately 
achieved because it is commonly perceived 
that some important transcription factors 
and their regulators respond at such low lev- 
els. In most cases, 3- to 5-fold was the mini- 
mum change that most were happy to 
accept. 

• Bioinformarics: perhaps the greatest concern 
was how to accurately interpret the data with 
the greatest accuracy and efficiency. The 
biggest headache is trying to identify net- 
works of gene expression that are common to 
different treatments or doses. The amount of 
data from a single experiment is huge. It may 
be that, in the future, several groups individ- 
ually equipped with specialized software algo- 
rithms for studying their favorite genes or 
gene systems will be able to share the same 
hybridized chips. Thus, arrays could usher in 
a new perspective on collaboration and the 
sharing of data. 

EPAMAC 

Perhaps the main reason most scientists are 
unable to use array technology is the high cost 
involved, whether buying off-the-shelf mem- 
branes, using contract printing services, or 
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Figure 3. Gene expression profiles — also called fingerprints or signatures — of known toxicants or toxi- 
cant families may, in the future, be used to identify the potential toxicity of new drugs, etc. In this exam- 
ple, the genetic signature of test compound 1 is identical to that of known peroxisome proliferate rs, 
whereas that of test compound 2 does not match any known toxicant family. Based on these results, test 
cpmpound 2 would be retained for further testing and test compound 1 would be eliminated. 
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producing chips in-house. In view of this, 
researchers at the RTD/NHEERL initiated 
the EPAMAC. This consortium brings 
together scientists from the EPA and a num- 
ber of extramural labs with the aim of devel- 
oping microarray capability through the shar- 
ing of resources and data. EPAMAC 
researchers are primarily interested in the 
developmental and toxicologic changes seen 
in testicular and breast tissue, and a portion 
of the workshop was set aside for EPAMAC 
members to share their ideas on how the 
experimental application of microarrays could 
facilitate their research. One of the central 
areas of interest to EPAMAC members is the 
effect of xenobiotics on male fertility and 
reproductive health. Of greatest concern is 
the effect of exposure during critical periods 
of development and germ cell differentiation 
( °), and how this may compromise sperm, 
counts and quality following sexual matura- 
tion (10). As well as spermatogenic tissue, 
there is also interest in how residual mRNA 
found in mature sperm (II) could be used as 
an indicator of previous xenobiodc effects (it 
is easier to obtain a semen sample than a tes- 
ticular biopsy). Arrays will be used to examine 
and compare the effect of exposure to heat 
and chemicals in testicular and epididymal 
gene expression profiles, with the aim of 
establishing relationships/associations 
between changes in developmental landmarks 
and the effects on sperm count and quality. 
Cluster, pattern, and other analysis of such 
data should help identify hidden relationships 
between genes that may reveal potential 
mechanisms of action and uncover roles for 
genes with unknown functions. 

Summary 

The full impact of DNA arrays may not be 
seen for several years, but the interest shown at 
this regional workshop indicates the high level 
of interest that they foster. Apart from educat- 
ing and advertising the various technologies in 
this field, this workshop brought together a 
number of researchers from the Research 
Triangle Park area who are already using DNA 
arrays. The interest in sharing ideas and experi- 
ences led to the initiation of a Triangle array 
user's group. 
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Array technology is still in its infancy. This 
meaiis that the hardware is still improving and 
there! is no current consensus for standard pro- 
cedures, quantitation, and interpretation. 
Consistency in spotting and scanning arrays is 
not yet optimized, and this is one of the most 
critical requirements of any experiment. In 
addition, one of the dark regions of array tech- 
nolo *y — strife in the courts over who owns 
what portions of it — has further muddled the 
future and is a potential barrier toward the 
development of consensus procedures. 

Perhaps the greatest hurdle for the applica- 
tion of arrays is the actual interpretation of 
data. No specialists in bioinforrnatics attended 
the Workshop, largely because they are rare and 
because as yet no one seems clear on the best 
method of approaching data analysis and inter- 
pretation. Cross-referencing results from mul- 
tiple 'experiments (time, dose, repeats, different 
animals, different species) to identify common* 
iy expressed genes is a great challenge. In most 
cases; we are still a long way from understand- 
ing How the "expression of gene Xis related to 
the Repression of gene Y, and ordering gene 
expression to delineate causal relationships. 

To the ordinary scientist in the typical lab- 
oratory, however, the most immediate prob- 
lem is a lack of affordable instrumentation. 
One! can purchase premade membranes at 
relatively affordable prices. Although these 
may 'be useful in identifying individual genes 
to pursue in more detail using other methods, 
the rj umbers that would be required for even a 
small routine toxicology experiment prohibit 
this as a truly viable approach. For the toxicol- 
ogisti, there is a need to carry out multiple 
experiments — dose responses, time curves, 
multiple animals, and repeats. Glass-based 
DNA arrays are most attractive in this context 
they can be prepared in large batches 
frorrj the same DNA source and accommo- 
date control and treated samples on the same 
chip] Another problem with current ofT-the- 
arrays is that they often do not contain 
one or more of the particular genes a group is 
interested in. One alternative is to obtain 
r produce a set of custom clones and 
contract printing of membranes or slides 
out by a company such as Genomic 
Solutions, Inc. (Ann Arbor, MI). This approach 
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is less expensive than laying out capital for 
one's own entire system, although at some 
point it might make economic sense to print 
one's own arrays. 

Finally, DNA arrays are currendy a team 
effort. They are a technology chat uses a wide 
range of skills including engineering, statistics, 
molecular biology, chemistry, and bioinfor- 
mancs. Because most individuals are skilled in 
only one or perhaps two of these areas, it 
appears that success with arrays may be best 
expected by teams of collaborators consisting 
of individuals having each of these skills. 

Those considering array applications may 
be amused or goaded on by the following 
quote from Fortune magazine (12): 

Microprocessors have reshaped our economy, . 
spawned vast fortunes and changed the way we live. 
Gene chips could be even bigger. 

Although this comment may have been 
designed to excite the imagination rather than 
accurately reflect the truth, it is fair to say that 
the age of functional genomics is upon us. 
DNA arrays look set to be an important tool in 
this new age of biotechnology and will likely 
contribute answers to some of toxicology's 
most fundamental questions. 
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Subject: RE: [Fwd: Toxicology Chip] 
Date: Mon. 3 Jul 2000 08:09:45 -0400 
From: "Afshari.Cynthia" <afshari<§'niehs.nih.gov> 
To: ""Diana Ramlei-Cox*" <dianahc@incyte.com> 

You car. see phe list of clones that we have on our 12K chip at 
htt? : nanus 1 .r.iehs . mh . ccv inaps • guest • clcnesrr r. . c f r. 

We selected a subset of genes (2000K) that webe.ievec critical tc -o-* 

response and basic cellular processes and added a see of clones a-- 

this. We have included a set of control genes (80-) that were see—e^ — 

the NHGRZ because they did not change across a larae se- o f a— av ~" 

experiments. However, we have found that some of these aenes~c-»-ce 

sigr.ficar.tly after tox treatments and are in the process cf loo*—? a- -*-e 

variation of each of these 80* genes across our experiments. 

Our chips are constantly changing and being updated and we hope the* e— 

data will lead us to what the toxchip should reallv be. 

Z hope this answers your question. 

Cindy Afshari 



> From: Diana Hamlez-Cox 

> Senz: Monday, June 26. 2000 8:52 PM 

> To; afshari9niehs.nih.gov 

> Subject: [Fwd: Toxicology Chip] 
> 

> Dear Dr. Afshari. 



> Since 1 have not yet had a response from Bill Grigg. perhaps he was not 

> the right person to contact. 
> 

> Can you help me in this matter? j don't need to know the sequences 

> necessarily, but I would like very much to know what types of seouences 

> are oeing used, e.g.. GPCRs (more specific?), ion channels, etc' 

> Diana Hamlet-Cox 
> 

> Original Message 

> Subject; Toxicology Chip 

> Date: Mon. 19 Jun 2000 18:31:48 -0700 

> From; Diana Hamlet-Cox <dianahc9incyte.com> 

> Organization: Incyte Pharmaceuticals 

> To: grigg&niehs. nih.gov 
> 

> Dear Colleague: 
> 

> I am doing literature research on the use of expressed aenes as 

> pharmacotoxicology markers, and found the Press Release' dated February 

> 29. 2000 regarding the work of the NIZHS in this area. 1 would like to 

> know i: there is a resource I can access (or you could provide*) that 

> wouid give me a list of the 12.000 genes that are on your Human ToxChip 

> Microarray. In particular. I am interested in the criteria used to 

> select sequences for the ToxChip. including any control sequences 

> included in the microarray. 
> 

> Thank you for your assistance in this request. 
> 

> Diana Hamlet -Cox. Ph.D. 

> Incyte Genomics. Inc. 

> 

> — 
> 



07/31/2000 10.3U AM 



> This «nai- message zs for the sole use of zhe ir.zer.ded reripier.z s zr. 

> may cozzair. cor.fider.zial and privileged izforzazior. sub^eez rr 

> azzomey-clier.z privilege. Any unauzhcrised review, use. disclosure rr 

> diszribuzioz is prohihized. Zf you are noz zhe inzezded reripier.z. 

> please conzarz zhe sender by reply enail and deszroy all copies cf zhe 

> original message. 

> 88BSIIS88SSSSSSSSSSSSSSSS 
> 

> 
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ABSTRACT Pairwise sequence comparison methods have 
been assessed using proteins whose relationships are known 
reliably from their structures and functions, as described in 
the scop database [Murzin, A. G., Brenner, S. E., Hubbard, T. 
& Chothia C. (1995) /. Mol. Biol. 247, 536-540]. The evalua- 
tion tested the programs BLAST [Altschul, S. F., Gish, W., 
Miller, W., Myers, E. W. & Lipman, D. J. (1990)./. Mol Biol 
215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) 
Methods Enzymol 266, 460-480], fasta [Pearson, W. R. & 
Lipman, D. J. (1988) Proc. Natl Acad. ScL USA 85, 2444-2448], 
and SSEARCH [Smith, T. F. & Waterman, M. S. (1981)7. Mol. 
Biol 147, 195-197] and their scoring schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores. The E-value statistical scores of ssearch and fasta are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P-values reported 
by BLAST and WU-BLAST2 exaggerate significance by orders of 
magnitude, ssearch, fasta ktup = 1, and WU-BLAST2 perform 
best, and they are capable of detecting almost all relationships 
between proteins whose sequence identities are >30%. For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 



Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
method's central role, it is surprising that overall and relative 
capabilities of different procedures are largely unknown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unambiguously and independently of the 
methods being evaluated. However, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however, these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 
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Sequence comparison methodologies have evolved rapidly, 
so no previously published tests has evaluated modern versions 
of programs commonly used. For example, parameters in 
blast (1) have changed, and wu-blast2 (2) — which produces 
gapped alignments — has become available. The latest version 
of FASTA (3) previously tested was 1.6, but the current release 
(version 3.0) provides fundamentally different results in the 
form of statistical scoring. 

The previous reports also have left gaps in our knowledge. 
For example, there has been no published assessment of 
thresholds for scoring schemes more sophisticated than per- 
centage identity. Thus, the widely discussed statistical scoring 
measures have never actually been evaluated on large data- 
bases of real proteins. Moreover, the different scoring schemes 
commonly in use have not been compared. 

Beyond these issues, there is a more fundamental question: 
in an absolute sense, how well does pairwise sequence com- 
parison work? That is, what fraction of homologous proteins 
can be detected using modern database searching methods? 

In this work, we attempt to answer these questions and to 
overcome both of the fundamental difficulties that have hin- 
dered assessment of sequence comparison methodologies. 
First, we use the set of distant evolutionary relationships in the 
SCOP: Structural Classification of Proteins database (4), which 
is derived from structural and functional characteristics (5). 
The SCOP database provides a uniquely reliable set of ho- 
mologs, which are known independently of sequence compar- 
ison. Second, we use an assessment method that jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be used to aid interpretation of real 
database searches and thus provide optimal and reliable 
results. 

Previous Assessments of Sequence Comparison. Several 
previous studies have examined the relative performance of 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6, 7), who compared 
the three most commonly used programs. Of these, the Smith- 
Waterman algorithm (8) implemented in SSEARCH (3) is the 
oldest and slowest but the most rigorous. Modern heuristics 
have provided blast (1) the speed and convenience to make 
it the most popular program. Intermediate between these two 
is FASTA (3), which may be run in two modes offering either 
greater speed (ktup = 2) or greater effectiveness (ktup = 1). 
Pearson also considered different parameters for each of these 
programs. 

To test the methods, Pearson selected two representative 
proteins from each of 67 protein superfamilies defined by the 
pir database (9). Each was used as a query to search the 
database, and the matched proteins were marked as being 
homologous or unrelated according to their membership of pir 
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Fig. 1. Coverage vs. error plots of different scoring schemes for ssearch Smith-Waterman. (A) Analysis of pdwod-b database. (B) Analysis 
of PDB90D-B database. All of the proteins in the database were compared with each other using the ssearch program. The results of this single 
set of comparisons were considered using five different scoring schemes and assessed. The graphs show the coverage and errors per query (EPQ) 
for statistical scores, raw scores, and three measures using percentage identity. In the coverage vs. error plot, the* axis indicates the fraction of 
all homologs in the database (known from structure) which have been detected. Precisely, it is the number of detected pairs of proteins with the 
same fold divided by the total number of pairs from a common superfamily. pdb40d-b contains a total of 9,044 homologs, so a score of 10% indicates 
identification of 904 relationships. The y axis reports the number of EPQ. Because there are 1,323 queries made in the pdwod-b all-vs.-all 
comparison, 13 errors corresponds to 0.01, or 1% EPQ. The y axis is presented on a log scale to show results over the widely varying degrees of 
accuracy which may be desired. The scores that correspond to the levels of EPQ and coverage are shown in Fig. 4 and Table 1. The graph 
demonstrates the trade-off between sensitivity and selectivity. As more homologs are found (moving to the right), more errors are made (moving 
up). The ideal method would be in the lower right corner of the graph, which corresponds to identifying many evolutionary relationships without 
selecting unrelated proteins. Three measures of percentage identity are plotted. Percentage identity within alignment is the degree of identity within 
the aligned region of the proteins, without consideration of the alignment length. Percentage identity within both is the number of identical residues 
in the aligned region as a percentage of the average length of the query and target proteins. The hssp equation (17) is H = 290.15/" 0 - 562 where 
/is length for 10 < / < 80; H > 100 for / < 10; H = 24.7 for / > 80. The percentage identity HSSP-adjusted score is the percent identity within 
the alignment minus H. Smith-Waterman raw scores and E-values were taken directly from the sequence comparison program 



perfect separation, with all of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead one is interested in 
drawing a threshold above which there are the largest number 
of related pairs of sequences consistent with an acceptable 
error rate. 

Our procedure involved measuring the coverage and error 
for every threshold. Coverage was defined as the fraction of 
structurally determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method. 
Errors per query (EPQ), an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the number of queries. Graphs of these data, called 
coverage vs. error plots, were devised to understand how 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
ciever Operating Characteristic (ROC) plots (33, 34) but 
better represent the high degrees of accuracy required in 
sequence comparison and the huge background of nonho- 
mologs. 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. The EPQ measure places a premium on score consis- 
tency; that is, it requires scores to be comparable for different 
queries. Consistency is an aspect which has been largely 
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Fig. 2. Unrelated proteins with high percentage identity. Hemo- 
globin 0-chain (pdb code lhds chain b, ref. 38, Left) and cellulase E2 
(pdb code ltml, ref. 39, Right) have 39% identity over 64 residues, a 
level which is often believed to be indicative of homology. Despite this 
high degree of identity, their structures strongly suggest that these 
proteins are not related. Appropriately, neither the raw alignment 
score of 85 nor the E-value of 1.3 is significant. Proteins rendered by 
rasmol (40). 
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Fig. 3. Length and percentage identity of alignments of unrelated 
proteins in pdbwd-b: Each pair of nonhomologous proteins found with 
ssearch is plotted as a point whose position indicates the length and 
the percentage identity within the alignment. Because alignment 
length and percentage identity are quantized, many pairs of proteins 
may have exactly the same alignment length and percentage identity. 
The line shows the hssp threshold (though it is intended to be applied 
with a different matrix and parameters). 
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likely, its power can be attributed to its incorporation of more 
information than any other measure; it takes account of the 
full substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately. 

We find that statistical scores are not only powerful, but also 
easy to interpret, ssearch and fasta show close agreement 
between statistical scores and actual number of errors per 
query (Fig. 4). The expectation value score gives a good, 
slightly conservative estimate of the chances of the two se- 
quences being found at random in a given query. Thus, an 
E-value of 0.01 indicates that roughly one pair of nonhomologs 
of this similarity should be found in every 100 different queries. 
Neither raw scores nor percentage identity can be interpreted 
in this way, and these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from blast also should be directly interpret- 
able but were found to overstate significance by more than two 
orders of magnitude for 1% EPQ for this database. Nonethe- 
less, these results strongly suggest that the analytic theory is 
fundamentally appropriate. WU-BLAST2 scores were more re- 
liable than those from blast, but also exaggerate expected 
confidence by more than an order of magnitude at 1% EPQ. 

Overall Detection of Homologs and Comparison of Algo- 
rithms. The results in Fig. 5A and Table 1 show that pairwise 
sequence comparison is capable of identifying only a small 
fraction of the homologous pairs of sequences in PDB40D-B. 
Even SSEARCH with E-values, the best protocol tested, could 
find only 18% of all relationships at a 1% EPQ. blast, which 
identifies 15%, was the worst performer, whereas fasta 
ktup = 1 is nearly as effective as ssearch. fasta ktup = 2 and 
WU-BLAST2 are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
slower, ssearch is 25 times slower than blast and 6.5 times 
slower than fasta ktup = 1. WU-BLAST2 is slightly faster than 
fasta ktup = 2, but the latter has more interpretable scores. 

In PDB90D-B, where there are many close relationships, the 
best method can identify only 38% of structurally known 
homologs (Fig. 55). The method which finds that many 
relationships is wu-blast2. Consequently, we infer that the 
differences between fasta kup = 1, ssearch, and WU-BLAST2 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability. 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity than would be expected 
by chance, ssearch with E-values can recognize >90% of the 
homologous pairs with 30-40% identity. In this region, there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-values, but 26 of these involve sequences with <50 
residues. Of sequences having 25-30% identity, 75% are 
identified by ssearch E-values. However, although the num- 
ber of homologs grows at lower levels of identity, the detection 
falls off sharply: only 40% of homologs with 20-25% identity 
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Fig. 6. Distribution and detection of homologs in pdbwd-b. Bars 
show the distribution of homologous pairs pdb40D-b according to their 
identity (using the measure of identity in both). Filled regions indicate 
the number of these pairs found by the best database searching method 
(ssearch with E-values) at 1% EPQ. The pdb40D-b database contains 
proteins with <40% identity, and as shown on this graph, most 
structurally identified homologs in the database have diverged ex- 
tremely far in sequence and have <20% identity. Note that the 
alignments may be inaccurate, especially at low levels of identity. Filled 
regions show that ssearch can identify most relationships that have 
25% or more identity, but its detection wanes sharply below 25%. 
Consequently, the great sequence divergence of most structurally 
identified evolutionary relationships effectively defeats the ability of 
pariwise sequence comparison to detect them. 

are detected and only 10% of those with 15-20% can be found. 
These results show that statistical scores can find related 
proteins whose identity is remarkably low; however, the power 
of the method is restricted by the great divergence of many 
protein sequences. 

After completion of this work, a new version of pairwise 
blast was released: blastgp (37). It supports gapped align- 
ments, like WU-BLAST2, and dispenses with sum statistics. Our 
initial tests on blastgp using default parameters show that its 
E-values are reliable and that its overall detection of homologs 
was substantially better than that of ungapped blast, but not 
quite equal to that of WU-BLAST2. 



CONCLUSION 

The general consensus amongst experts (see refs. 7, 24, 25, 27 
and references therein) suggests that the most effective se- 
quence searches are made by (i) using a large current database 
in which the protein sequences have been complexity masked 
and (ii) using statistical scores to interpret the results. Our 
experiments fully support this view. 

Our results also suggest two further points. First, the E-val- 
ues reported by fasta and ssearch give fairly accurate 
estimates of the significance of each match, but the P-values 
provided by blast and wu-blast2 underestimate the true 



Table 1. Summary of sequence comparison methods with pdb40D-b 


Method 


Relative Time* 


1% EPQ Cutoff 


Coverage at 1% EPQ 


ssearch % identity: within alignment 


25.5 


>70% 


<0.1 


ssearch % identity: within both 


25.5 


34% 


3.0 


ssearch % identity: HSSP-scaled 


25.5 


35% (hssp + 9.8) 


4.0 


ssearch Smith-Waterman raw scores 


25.5 


142 


10.5 


ssearch E-values 


25.5 


0.03 


18.4 


fasta ktup = 1 E-values 


3.9 


0.03 


17.9 


fasta ktup = 2 E-values 


1.4 


0.03 


16.7 


wu-blast2 P-values 


1.1 


0.003 


17.5 


blast P-values 


1.0 


0.00016 


14.8 


*Times are from large database searches with genome proteins. 
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Proteomics is a new enabling technology that is being 
integrated into the drug discovery process. This will 
facilitate the systematic analysis of proteins across any 
biological system or disease, forwarding new targets 
and information on mode of action, toxicology and sur- 
rogate markers. Proteomics is highly complementary to 
genomic approaches in the drug discovery process and, 
for the first time, offers scientists the ability to integrate 
information from the genome, expressed mRNAs, their 
respective proteins and subcellular localization. It is ex- 
pected that this will lead to important new insights into 
disease mechanisms and improved drug discovery 
strategies to produce novel therapeutics, 

Among the major pharmaceutical and biotechnol- 
ogy companies, it is clearly recognized that the 
business of modern drug discovery is a highly 
competitive process. All of the many steps in- 
volved are inherently complex, and each can involve a 
high risk of attrition. The players in this business strive 
continuously to optimize and streamline the process; each 
seeking to gain an advantage at every step by attempting 
to make informed decisions at the earliest stage possible. 
The desired outcome is to accelerate as many key activities 
in the drug discovery process as possible. This should pro- 



duce a new generation of robust drugs that offer a high 
probability of success and reach the clinic and market 
ahead of the competition. 

There has been noticeable emphasis over recent years 
for companies to aggressively review and refine their 
strategies to discover new drugs. Central to this has been 
the introduction and implementation of cutting-edge 
technologies. Most, if not all, companies have now inte- 
grated key technology platforms that incorporate gen- 
omics, mRNA expression analysis, relational databases, 
high-throughput robotics, combinatorial chemistry and 
powerful bioinformatics. Although it is still early days to 
quantify the real impact of these platforms in clinical and 
commercial terms, expectations are high, and it is widely 
accepted that significant benefits will be forthcoming. This 
is largely based on data obtained during preclinical studies 
where the genomic 1 2 and microarray 3 ^ technologies have 
already proved their value. 

However, there are several noteworthy outcomes that re- 
sult from this. Many comments are voiced that scientists 
armed with these technologies are now commonly faced 
with data overload. Thus, in some instances, rather than 
facilitating the decision process, the accumulation of more 
complex data points, many with unknown consequences, 
can seem to hinder the process. Also, most drug compa- 
nies have simultaneously incorporated very similar compo- 
nents of the new technology platforms, the consequence 
being that it is becoming difficult yet again to determine 
where a clear competitive advantage will arise. Finally, in 
recent years, largely as a result of the accessibility of the 
technologies, there has been an overwhelming emphasis 
placed on genomic and mRNA data rather than on protein 



Martin J. Page*, Bob Amess. Christian Rohlff, Colin Stubberfield and Raj Parekh, Oxford GlycoSciences, 10 The Quadrant, 
Abingdon Science Park, Abingdon, Oxfordshire, UK 0X14 3YS. # tel: +44 1235 543277, fax: +44 1235 543283, 
e-mail: martin.page@ogs.co.uk 



DDT Vol. 4, No. 2 February 1999 1359-6446/99/$ - see front matter © Elsevier Science. All rights reserved. Pll: SI 359-6446198)01 291-4 



55 



research focus 



Sample 2D gels and Cu ration and Differential analysis Mass spectrometry 

imaging interrogation (Proteograph™) and annotation 




Figure 1. Steps involved in analysing a biological sample by proteomics. MCI, molecular cluster index. 



analysis. It is important to remember that proteins dictate 
biological phenotype - whether it is normal or diseased - 
and are the direct targets for most drugs. 

Proteomics: new technology for 
the analysis of proteins 

It is now timely to recognize that complementary technol- 
ogy in the form of high-throughput analysis of the total 
protein repertoire of chosen biological samples, namely 
proteomics, is poised to add a new and important dimen- 
sion to drug discovery. In a similar fashion to genomics, 
which aims to profile every gene expressed in a cell, pro- 
teomics seeks to profile every protein that is expressed 5 " 7 . 
However, there is added information, since proteomics can 
also be used to identify the post-translational modifications 
of proteins 8 , which can have profound effects on bio- 
logical function, and their cellular localization. Importantly, 
proteomics is a technology that integrates the significant 
advances in two-dimensional (2D) electrophoretic separa- 
tion of proteins, mass spectrometry and bioinformatics. 
With these advances it is now possible to consistently de- 
rive proteomes that are highly reproducible and suitable 
for interrogation using advanced bioinformatic tools. 

There are many variations whereby different laboratories 
operate proteomics. For the purpose of this review, the 



process used at Oxford GlycoSciences (OGS), which uses 
an industrial-scale operation that is integral to its drug dis- 
covery work, will be described. The individual steps of 
this process, where up to 1000 2D gels can be run and 
analysed per week, are summarized in Fig. 1. The incom- 
ing samples are bar coded and all information relevant to 
the sample is logged into a Laboratory Information 
Management System (LIMS) database. There can be a wide 
range in the type of samples processed, as applicable to 
individual steps in the drug discovery pipeline, and these 
will be mentioned later. The samples are separated accord- 
ing to their charge (pi) in the first dimension, using iso- 
electric focusing, followed by size (MW) using SDS-PAGE 
in the second dimension. Many modifications have been 
made to these steps to improve handling, throughput and 
reproducibility. The separated proteins are then stained 
with fluorescent dyes which are significantly more sensi- 
tive in detection than standard silver methods and have a 
broader dynamic range. The image of the displayed pro- 
teins obtained is referred to as the proteome, and is digi- 
tally scanned into databases using proprietary software 
called ROSETTA™. The images are subsequently curated, 
which begins with the removal of any artefacts, cropping 
and the placement of pI/MW landmarks. The images from 
replicate images are then aligned and matched to one 



56 



DDT Vol. 4, No. 2 February 1999 



another to generate a synthetic composite image. This is 
an important step, as the proteome is a dynamic situation, 
and it captures the biological variation that occurs, such 
that even orphan proteins are still incorporated into the 
analysis. 

By means of illustration, Fig. 1 shows the process 
whereby proteomes are generated from normal and dis- 
ease samples and how differentially expressed proteins are 
identified. The potential of this type of analysis is tremen- 
dous. For example, from a mammalian cell sample, in ex- 
cess of 2000 proteins can typically be resolved within the 
proteome. The quality of this is shown in Fig. 2, which 
shows representative proteomes from three diverse bio- 
logical sources: human serum, the pathogenic fungus 
Candida albicans and the human hepatoma cell line 
Huh7. 

Use of proteomics to identify 
disease specific proteins 

In most cases, the drug discovery process is initiated by 
the identification of a novel candidate target - almost al- 
ways a protein - that is believed to be instrumental in the 
disease process. To date, there is a variety of means 
whereby drug targets have been forthcoming. These in- 
clude molecular, cellular and genomic approaches, mostly 
centred upon DNA and mRNA analysis. The gene in ques- 
tion is isolated, and expression and characterization of its 
coded protein product - i.e. the drug target - is invariably 
a secondary event. 

With the proteomic approach, the starting point is at the 
other end of the 'telescope'. Here there is direct and im- 



mediate comparison of the proteomes from paired normal 
and disease materials. Examples of these pairs are: (1) pu- 
rified epithelial cell populations derived from human 
breast tumours, matched to purified normal populations of 
human breast epithelial cells, and (2) the invading patho- 
genic hyphal form of C. albicans, matched to the non- 
invading yeast form of C. albicans. When the proteome 
images from each pair are aligned, the Proteograph™ soft- 
ware is able to rapidly identify those proteins (each refer- 
enced as having a unique molecular cluster index, or MCI) 
that are either unique, or those that are differentially ex- 
pressed. Thus, the Proteograph output from this analysis is 
both qualitative and quantitative. 

Proteograph analysis for a particular study can also be 
undertaken on any number of samples. For example, one 
might compare anything from a few to several hundred 
preparations or samples, each from a normal and disease 
counterpart, and have these analysed in a single 
Proteograph study. In this way, it is possible to assign 
strong statistical confidence to the data and in some in- 
stances to identify specific subpopulations within the input 
biological sources. This feature will become increasingly 
significant in the near future, and there is a clear synergy 
here whereby proteomics can work closely with pharma- 
cogenomic approaches to stratify patient populations and 
achieve effective targeted care for the patient. Whatever 
the source of the materials, the net output of Proteograph 
analysis is immediate identification of disease specific pro- 
teins. This is shown in Fig. 3, which shows the results of 
a proteograph obtained by comparing untreated human 
hepatoma cells with cells following exposure to a clinical 
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Figure 2. Representative proteomes obtained from (a) human serum, (b) the pathogenic fungus Candida albicans 
and (c) the human hepatoma cell line Huh7. 
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Foregrounds: Huh7 cells treated with 5FU 

Backgrounds: Huh7 cells untreated 

■HHHI Upregulated in Huh7 cells treated with 5FU 

with respect to untreated Huh7 cells 
■■■^■H Down regulated in Huh7 cells treated with 5FU 
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Figure 3. Table of differential protein expression 
profiles, referred to as a Rosetta Proteograph™, 
between Huhl cells with and without the cytotoxic 
agent 5-FU. Bars are quantized and do not represent 
exact fold change values. 



cytotoxic agent. In this instance, only the top 20 differen- 
tially expressed MCIs are shown, hut the readout would 
normally extend to a defined cut-off value, typically a two- 
fold or greater difference in expression levels, determined 
by the user. 

In a typical analysis involving disease and normal mam- 
malian material, in which each proteome would have 
-2000 protein features each assigned an MCI, the proteo- 
graph might identify somewhere in the region of 50-300 
MCIs that are unique or differentially expressed. To capi- 
talize rapidly on these data, at OGS a high-throughput 



mass spectrometry facility coupled to advanced databases 
to annotate these MCIs as individual proteins is applied. As 
these are all disease specific proteins, each could represent 
a novel target and/or a novel disease marker. The process 
becomes even more powerful when a panel of features, 
rather than individual features, are assigned. The relevance 
of this is apparent when one considers that most diseases, 
if not all, are multifactorial in nature and arise from poly- 
genic changes. Rather than analysing events in isolation, 
the ability to examine hundreds or thousands of events 
simultaneously, as shown by proteomics, can offer real 
advantages. 

Identification and assignment of candidate targets 
The rapid identification and assignment of candidate tar- 
gets and markers represents a huge challenge, but this has 
been greatly facilitated by combining the recent advances 
made in proteomics and analytical mass spectrometry 9 . 
Using automated procedures it is now possible to annotate 
proteins present in femtomole quantities, which would de- 
pict the low abundance class of proteins. The process of 
annotation is similarly aided by the quality and richness of 
the sequence specific databases that are currently avail- 
able, both in the public domain and in the private sector 
(e.g. those supplied by Incyte Pharmaceuticals). In this re- 
spect, the advances in proteomics have benefited consider- 
ably from the breakthroughs achieved with genomics. 

From an application perspective, cancer studies provide a 
good opportunity whereby proteomics can be instrumental 
in identifying disease specific proteins, because it is often 
feasible to obtain normal and diseased tissue from the same 
patient. For example, proteomic studies have been re- 
ported on neuroblastomas 10 , human breast proteins from 
normal and tumour sources 11 " 13 , lung tumours 14 , colon tu- 
mours 15 and bladder tumours 16 . There are also proteomic 
studies reported within the cardiovascular therapeutic area, 
in which disease or response proteins are identified 1718 . 

Genomic microarray analysis can similarly identify 
unique species or clusters of mRNAs that are disease spe- 
cific. However, in some instances, there is a clear lack of 
correlation between the levels of a specific mRNA and its 
corresponding protein (Ref. 19, Gypi, S.P. et al., submit- 
ted). This has now been noted by many investigators and 
reaffirms that post-transcriptional events, including protein 
stability, protein modification (such as phosphorylation, 
glycosylation, acylation and methylation) and cell localiz- 
ation, can constitute major regulatory steps. Proteomic 
analysis captures all of these steps and can therefore pro- 
vide unique and valuable information independent from, 
or complementary to, genomic data. 
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Proteomics for target validation and signal transduc- 
tion studies 

The identification of disease specific proteins alone is in- 
sufficient to begin a drug screening process. It is critical to 
assign function and validation to these proteins by con- 
firming they are indeed pivotal in the disease process. 
These studies need to encompass both gain- and loss-of- 
function analyses. This would determine whether the activity 
of a candidate target (an enzyme, for example), eliminated 
by molecular/cellular techniques, could reverse a disease 
phenotype. If this happened, then the investigator would 
have increased confidence that a small-molecule inhibitor 
against the target would also have a similar effect. The 
proposal of candidate drug targets is often not a difficult 
process, but validating them is another matter. Validation 
represents a major bottleneck where the wrong decision 
can have serious consequences 20 . 

Proteomics can be used to evaluate the role of a chosen 
target protein in signal transduction cascades directly rel- 
evant to the disease. In this manner, valuable information 
is forthcoming on the signalling pathways that are per- 
turbed by a target protein and how they might be cor- 
rected by appropriate therapeutics. Techniques that are 
well established in one-dimensional protein studies to in- 
vestigate signalling pathways, such as western blotting 
and immunoprecipitation, are highly suited to proteomic 
applications. For example, the proteomes obtained can be 
blotted onto membranes and probed with antibodies 
against the target protein or related signalling mol- 
ecules 21 " 23 . Because proteomics can resolve >2000 pro- 
teins on a single gel, it is possible to derive important 
information on specific isoforms (such as glycosylated or 
phosphorylated variants) of signalling molecules. This will 
result in characterization of how they are altered in the 
disease process. Western immunoblotting techniques 
using high-affinity antibodies will typically identify pro- 
teins present at -10 copies per cell (-1.7 fmol); this is in 
contrast to the best fluorescent dyes currently available 
that are limited to imaging proteins at 1000 or more 
copies per cell. The level of sensitivity derived by these 
applications will greatly facilitate interpretation of com- 
plex signalling pathways and contribute significantly to 
validation of the target under study. 

Immunoprecipitation studies 

Similarly, immunoprecipitation studies are another useful 
way to exploit the resolving power of proteomics 2125 . In 
this instance, very large quantities of protein (e.g. several 
milligrams) can be subjected to incubation with antibodies 
against chosen signalling molecules. This allows high-affin- 




ity capture of these proteins, which can subsequently be 
eluted and electrophoresed on a 2D gel to provide a high- 
resolution proteome of a specific subset of proteins. 
Detection by blot analysis allows the identification of ex- 
tremely small amounts of defined signalling molecules. 
Again, the different isoforms of even very low abundance 
proteins can be seen, and, very importantly, the technique 
allows the investigator to identify multiprotein complexes 
or other proteins that co-precipitate with the target protein. 
These coassociating proteins frequently represent sig- 
nalling partners for the target protein, and their identifi- 
cation by mass spectrometry can lead to invaluable infor- 
mation on the signalling processes involved. 

The depth of signal transduction analysis offered by 
proteomics, and the utility for target validation studies, 
can be extended even further by applying cell fraction- 
ation studies 26 " 28 . By purifying subcellular fractions, such 
as membrane, nuclear, organelle and cytosolic, it is possi- 
ble to assign a localization to proteins of interest and to 
follow their trafficking in a cell. Enrichment of these frac- 
tions will also allow much higher representation of low 
abundance proteins on the proteome. Their detection by 
fluorescent dyes or immunoblot techniques will lead to 
the identification of proteins in the range of 1-10 copies 
per cell, putting the sensitivity on a par with genomic 
approaches. 

These signal transduction analyses can be of additional 
value in experiments where inhibitors derived from a 
screening programme against the target are being evalu- 
ated for their potency and selectivity. The inhibitors can 
encompass small molecules, antisense nucleic acid con- 
structs, dominant-negative proteins, or neutralizing anti- 
bodies microinjected into cells. In each case, proteome 
analysis can provide unique data in support of validation 
studies for a chosen candidate drug target. 

Proteomics and drug mode-of-action studies 

Once a validated target is committed to a screening regi- 
men to identify and advance a lead molecule, it is impor- 
tant to confirm that the efficacy of the inhibitor is through 
the expected mechanism. Such mode-of-action studies are 
usually tackled by various cell biological and biochemical 
methods. Proteomics can also be usefully applied to these 
studies and this is illustrated below by describing data ob- 
tained with OGT719. This is a novel galactosyl derivative of 
the cytotoxic agent 5-fluorouracil (5-FU), which is currently 
being developed by OGS for the treatment of hepatocel- 
lular carcinoma and colorectal metastases localized 
in the liver. The premise underpinning the design and ra- 
tionale of OGT719 was to derive a 5-FU prodrug capable 
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Figure 4. Features that are specifically up- or downregulated in Huh 7 cells by either 5-fluorouracil (5-FU) or 
OGT719: (a) elongation factor loc2, (b) novel (three peptides by MS-MS) and (c) a-subunit of prolyl-4-bydroxylase. 
Arrows indicate up- or downregulated. 



of targeting, and being retained in, cells bearing the asialo- 
glycoprotein receptor (ASGP-r), including hepatocytes 29 , 
hepatoma Huh7 cells 30 and some colorectal tumour cells 31 . 
The growth of the human hepatoma cell line Huh7 is in- 
hibited by 5-FU or by OGT719- If the inhibition by 
OGT719 were the result of uptake and conversion to 5-FU 
as the active component, then it would be expected that 
Huh7 cells would show similar proteome profiles follow- 
ing exposure to either drug. 

To examine these possibilities, we conducted an experi- 
ment taking samples of Huh7 cells that had been treated 
with IC 50 doses of either OGT719 or 5-FU. Total cell lysates 
were prepared and taken through 2D electrophoresis, 
fluorescence staining, digital imaging and Proteograph 
analysis. To facilitate the interpretation of the data across 
all of the 2291 features seen on the proteomes, drug- 
induced protein changes of fivefold or greater, identified 
by the Proteograph, were analysed further. Interestingly, 
from this analysis 19 identical proteins were changed five- 
fold or more by both drugs, strongly suggesting similarities 
in the mode of action for these two compounds. 

Thus, from very complex data involving >2000 protein 
features, using proteomics it is possible to analyse quanti- 
tatively and qualitatively each protein during its exposure 
to drugs. The biologist is now able to focus a series of fur- 
ther studies specifically on an enriched subset of proteins. 



Figure 4 shows highlighted examples of the selected areas 
of the proteome where some of these identified proteins in 
the above study are altered in response to either or both 
drugs. 

Several of the proteins identified above as being modu- 
lated similarly by 5-FU or OGT719 in Huh7 cells were sub- 
jected to tandem mass-spectrometric analysis for anno- 
tation. Some of these, such as the nuclear ribosomal 
RNA-binding protein 32 , can be placed into pyrimidine 
pathways or related cell cycle/growth biochemical path- 
ways in which 5-FU is known to act. 

To attribute further significance to the proteome mode- 
of-action studies with OGT719, another cell line, the rat 
sarcoma HSN, was used. Growth of these cells is inhibited 
by 5-FU, but they are completely refractory to OGT719; 
notably they lack the ASGP-r, which might explain this 
finding (unpublished). For our proteome studies, HSN 
cells were treated with 5-FU or OGT719 over a time course 
of one, two and four days. At each time point, cells were 
harvested and processed to derive proteomes and 
Proteographs. As before, we purposely focused on those 
proteins that increased or decreased by fivefold or more. 
In this instance, there were no proteins co-modulated by 
the two drugs. This is perhaps to be expected, given that 
the HSN cells are killed by 5-FU and yet are refractory to 
OGT719. 
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Clear potential 

The above is just an example of how proteomics can be 
used to address the mode of action of anticancer drugs. 
The potential of this approach is clear, and one can envis- 
age situations where it will be profitable to compare the 
proteomes of cells in which the drug target has been elimi- 
nated by molecular knockout techniques, or with small- 
molecule inhibitors believed to act specifically on the same 
target- In addition to using proteomics to examine the ac- 
tion of drugs, it is also possible to use this approach to 
gauge the extent of nonspecific effects that might eventu- 
ally lead to toxicity. For instance, in the example used 
above with HSN cells treated with OGT719, although cell 
growth was not affected, the levels of several specific pro- 
teins were changed. Further investigation of these proteins 
and the signalling pathways in which they are involved 
could be illuminating in predicting the likelihood or other- 
wise of long-term toxicity. 

Use of proteomics in formal drug 
toxicology studies 

A drug discovery programme at the stage where leads 
have been identified and mode-of-action studies are ad- 
vanced, will proceed to investigate the pharmacokinetic 
and toxicology profile of those agents. These two param- 
eters are of major importance in the drug discovery 
process, and many agents that have looked highly promis- 
ing from in vitro studies have subsequently failed because 
of insurmountable pharmacokinetic and/or toxicity prob- 
lems in vivo. Whereas the pharmacokinetic properties of a 
molecule can now be characterized quickly and accu- 
rately, toxicity studies are typically much longer and more 
demanding in their interpretation. 

The ability to achieve fast and accurate predictions of 
toxicity within an in vivo setting would, represent a big 
step forward in accelerating any drug discovery pro- 
gramme. Toxicity from a drug can be manifested in any 
organ. However, because the liver and kidney are the 
major sites in the body responsible for metabolism and 
elimination of most drugs, it is informative to examine 
these particular organs in detail to provide early indi- 
cations about events that might result in toxicity. 

The basis for most xenobiotic metabolizing activity is to 
increase the hydrophilicity of the compound and so facili- 
tate its removal from the body. Most drugs are metabo- 
lized in the liver via the cytochrome P450 family of en- 
zymes, which are known to comprise a total of -200 
different members 3334 , encompassing a wide array of 
overlapping specificities for different substrates. In addi- 
tion to clearance, they also play a major role in metabo- 



lism that can lead to the production and removal of toxic 
species, and in some instances it is possible to correlate 
the ability or failure to remove such a toxin with a specific 
P450 or subgroup. 

Unique P450 profiles 

Each individual person will have a slightly different P450 
profile, largely from polymorphisms and changes in ex- 
pression levels, although other genetic and environmental 
factors aside from P450 also need to be taken into consid- 
eration. A significant amount of research is currently 
being directed towards this field - known as pharmacoge- 
nomics - with the aim of predicting how a patient will re- 
spond to a drug, as determined by their genetic make- 
U p35-37 The marked variation of individuals in their ability 
to clear a compound can be one of the key factors in de- 
ciding the overall pharmacokinetic profile of a drug. Not 
only will this have a bearing on the likelihood of a patient 
responding to a treatment, but it will also be a factor in 
determining the possibility of their experiencing an ad- 
verse effect. 

Many pharmaceutical companies are already employing 
genomic approaches, involving P450 measurements, as a 
key step in their assessment of the toxicological profile of 
a candidate drug and therefore of its suitability, or other- 
wise, to be considered for human clinical trials. There are 
limits to this approach, however. Whereas the P450 mRNA 
profiling can predict with some accuracy the likely meta- 
bolic fate of a drug, it will not provide information on 
whether the metabolites would subsequently lead to tox- 
icity. Besides the patient-to-patient differences in steady- 
state levels of the P450s, there are also characteristic induc- 
tion responses of these enzymes to some drugs. Moreover, 
as there can be some doubt over the correlation of mRNA 
levels and the corresponding protein levels, there is scope 
for misinterpretation of the results and hence real advan- 
tages to be gained from a proteome approach. In both in- 
stances, the ability to examine entire proteome profiles, in- 
cluding the P450 proteins, will be a significant advantage 
in understanding and predicting the metabolism and 
toxicological outcome of drugs. 

In addition to direct organ and tissue studies, the serum, 
which collects the majority of toxicity markers released 
from susceptible organs and tissues throughout the entire 
body, can be utilized. Serum is rich in nuclease activity 
and, as pharmacogenomics is not suited to deal with these 
samples, valuable markers of toxicity could go undetected. 
However, by using proteomics for these types of analyses, 
serum markers (and clusters thereof) are now accessible 
for evaluation as indicators of toxicity. 
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Pharmacopro teomics 

Proteomics can thus be used to add a new sphere of 
analysis to the study of toxicity at the protein level, and in 
the era of '-omics' there is a case to be made to adopt the 
term 'Pharmacoproteomics™'. Animals can be dosed with 
increasing levels of an experimental drug over time, and 
serum samples can be drawn for consecutive proteome 
analyses. Using this procedure, it should be possible to 
identify individual markers, or clusters thereof, that are 
dose related and correlate with the emergence and severity 
of toxicity. Markers might appear in the serum at a defined 
drug dose and time that are predictive of early toxicity 
within certain organs and if allowed to continue will have 
damaging consequences. These serum markers could sub- 
sequently be used to predict the response of each individ- 
ual and allow tailoring of therapy whereby optimal effi- 
cacy is achieved without adverse side effects being 
apparent. This application can obviously extend to track- 
ing toxicity of drugs in clinical trials where serum can be 
readily drawn and analysed. Surrogate markers for drug ef- 
ficacy could also be detected by this procedure and could 
facilitate the challenge of identifying patient classes who 
will respond favourably to a drug and at what dosage. 

Conclusions 

By contrast to the agents administered to patients in clini- 
cal wards, the process of drug discovery is not a prescrip- 
tive series of steps. The risks are high and there are long 
timelines to be endured before it is known whether a can- 
didate drug will succeed or fail. At each step of the drug 
discovery process there is often scope for flexibility in in- 
terpretation, which over many steps is cumulative. The 
pharmaceutical companies most likely to succeed in this 
environment are those that are able to make informed 
accurate decisions within an accelerated process. 

The genomics revolution has impacted very positively 
upon these issues and now has a powerful new partner in 
proteomics. The ability to undertake global analysis of pro- 
teins from a very wide diversity of biological systems and 
to interrogate these in a high-throughput, systematic man- 
ner will add a significant new dimension to drug discov- 
ery. Each step of the process from target discovery to clini- 
cal trials is accessible to proteomics, often providing 
unique sets of data. Using the combination of genomics 
and proteomics, scientists can now see every dimension of 
their biological focus, from genes, mRNA, proteins and 
their subcellular localization. This will greatly assist our 
understanding of the fundamental mechanistic basis of 
human disease and allow new improved and speedier 
drug discovery strategies to be implemented. 
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August 11, 1997, Monday 

SECTION: Financial News 

DISTRIBUTION: TO BUSINESS AND MEDICAL EDITORS 
LENGTH: 478 words 

HEADLINE: Eli Lilly & Co. and Acacia Biosciences Enter Into Research Collaboration; 
First Corporate Agreement for Acacia's Genome Reporter Matrix(TM) 

DATELINE: RICHMOND, Calif., Aug. 11 

BODY: 

Acacia Biosciences and Eli Lilly and Company (Lilly) announced today the signing of a joint research collaboration 
to utilize Acacia's Genome Reporter Matrix(TM) (GRM) to aid in the selection and optimization of lead compounds. 
Under the collaboration, Acacia will provide chemical and biological profiles on a class of Lilly's compounds for an 
undisclosed fee. 

Acacia's GRM is an assay-based computer modeling system that uses yeast as a miniature ecosystem. The GRM 
can profile the extent, nature and quantity of any changes in gene expression. Because of the similarities between 
the yeast and human genome, the system serves as an excellent surrogate for the human body, mimicking the effects 
induced by a biologically active molecule. 

"Using yeast as a model organism for lead optimization makes a lot of sense given the high degree of homology with 
human metabolic pathways," said William Current of Lilly Research Laboratories. "Acacia's innovative GRM has 
the potential to provide enormous insight into the therapeutic impact of our compounds and make the drug discovery 
process more rational. It should substantially accelerate the development process." 

"This first agreement with a major pharmaceutical company is an important milestone in the development of 
Acacia," said Bruce Cohen, President and CEO of Acacia. "The deal is in line with our strategy of establishing 
alliances that will allow our collaborators to use genomic profiles to identify and optimize compounds within 
their existing portfolios. In the long run, this technology can be used to characterize large scale combinatorial 
libraries, predict side effects prior to clinical trials and resurrect drugs that have failed during clinical trials." 

The GRM incorporates two critical elements: chemical response profiles and genetic response profiles. The 
chemical response profiles measure the change in gene expression caused by potential therapeutics and then rank genes 
with altered expressions by degree of response. The genetic response profiles measure changes in gene expression 
caused by mutations in the genes encoding potential targets of pharmaceuticals; these genetic response profiles represent 
gold standards in drug discovery by defining the response profile expected for drugs with perfect selectivity and 
specificity. By comparing the two profiles, one can analyze a potential drug candidate's ability to mimic the action of 
a 'perfect' drug. 

Acacia Biosciences is a functional genomics company developing proprietary technologies to enhance the speed 
and efficacy of drug discovery and development. Acacia's Genome Reporter Matrix capitalizes on the latest advances 
in genomics and combinatorial chemistry to generate comprehensive profiles of drug candidates' in vivo activity. 
SOURCE Acacia Biosciences 

CONTACT: Bruce Cohen, President and CEO of Acacia Biosciences, 510-669-2330 ext. 103 or Media: Linda 
Seaton of Feinstein 
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Pharmagene 
Raises More 
Capital for 
Research on 
Human 
Tissues 

By Sophia Fox 

Pharmagene, the Royston, 
U.K.-bascd biopharmaceuti- 
cal company specialising in 
the use of human biomaterials for 
drug discovery research, has raised a 
further £5 million from 8 group of 
investors led by 3i and Abacus 
Nominees. The funding will enable 
the company to expand both its 
human biomaterials collection and 
its capabilities across a range of pro- 
prietary platform technologies. 

Gordon Baxter, Ph.D., 
Pharmagene* cofounder and chief 
operating officer, claimed "by the 
end of thus year Pharmagene will 
have access to the largest collection 
of human RNAs and proteins any- 
where in the world, and a range of 
innovative, yet robust technologies 

SEE PHARMAGENE. P. 9 



Perkin-Elmer Acquires PerSeptive to Expand 
Its Capabilities in Ciene-BasedDrug Discovery 



By John Sterling 

Perldn-Elmert (PE; Norwalk, 
CT) decision last month to 
acquire PerSeptive Blo- 
systems (Framingham, MA) via a 
$360 million stock swap was 
designed to strengthen PE in terms 
of broad capabilities in gene-based 
drug discovery. The companys 
main goal is to develop new prod- 
ucts to improve the integration of 
genetic and protein research. 

"This merger will enhance our 
position as an effective provider of 
innovative, integrated platforms 
enabling our customers to be more 
efF dent and cost-effective in bring- 
ing new pharmaceuticals to mar- 
ket;* says Tony L. White, PEs 
chairman, president and CEO. "The 
combination of our two companies 
should bolster our presence in the 
life sciences, [and it is our] belief 
that we must take bold action now 
to lead the emerging era of molecu- 
lar medicine with leading positions 
in both genetic and protein analy- 
sis." 

A driving force behind the 
merger is the vast amount of genet- 



FDA OKs Genzyme's Carticel 
Product for Damage to Knees 
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Carticel which was approved for the repair of clinically significant, symp- 
tomatic cartilaginous defects of the femoral condyle (medial, lateral or 
trochlear) caused by acute or repetitive trauma, employs a proprietary 
process to grm> autologous cartilage celk for implantation. 



By Naomi PfeifTer 

The FDA has approved a knee- 
cartilage replacement product 
made by Genzyme Tissue 
Repair (Cambridge, MA), a track- 
ing-stock division of Genzyme 
Corp., for people with trauma- 
damaged knees. 

Carticel'" (autologous cultured 
choTidrocytes) is the first product to 
be licensed under the FDAs pro- 
SEE GENZYME, P. 6 
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Firkin- 
Elmer 
acquired 
PerSeptive 
Biosystems 
for $360 
million to 
obtain new 
technologies 
in mass 
spectrome- 
try, biosepa- 
rations and 
purification 
for product 
development 
projects, 
spanning the 
range from 
genomics to 
proteomics. 



ic information about human dis- 
ease that is being accumulated by 
researchers and biotcch companies 
working in the area of genomics. It 
is becoming increasingly obvious 
that these data need to be comple- 
mented with technologies for 



studying proteins and protein net- 
works — a field known as pro- 
teomics {see GEN. September I. 
1997. p. n 

PE officials, who claim that 
MALDI-TOF (Matrix Assisted 

SEE ACQUISITION. P. 10 



Strategies for Target Validation 
Streamline Evaluation of Leads 



ByVkklGlaser 

A cacla Biosciences (Rich- 
f\ mond, CA) last month 
X mannounced its first agree- 
ment with a major pharmaceutical 
company, signing a deal with Eli 
Lilly (Indianapolis, IN) to use 
Acacias Genome Reporter Matrix 
(GRM) to select and optimize some 
of Lillys lead ct impounds. Acacias 
yeast-based system for profiling 
drug activity is useful for evaluating 
the therapeutic potential of lead 
compounds, and it also has a role in 
the identification and validation of 
new drug targets. 

"We're using the ecosystem of a 
cell to allow us to deduce the mech- 
anism of action and target for any 
chemical." explains Bruce Cohen, 
president and CEO. "We screen for 
every target in a cell simultancous- 
ry...using transcription as a readout 



for how a cell is adapting to any 
rjerturbation," he says. 

The GRM technology consists of 
two main databases: one is the 
genetic response profile, showing 
the effects of mutations in each 
individual yeast gene and compen- 
satory gene regulatory mecha- 
nisms; the other is the chemical 
response profile, which documents 
changes in gene expression in 
response to chemical compounds. 
Computational analysis and pattern 
matching between the genetic and 
chemical profiles yields informa- 
tion on the specificity, potency and 
side-effects risk of a drug lead. 

Targeting Targets 

No longer is mapping and 
sequencing a gene— or the human 
genome — an end unto itself, but 
SEE TARGET, P. 18 



Sticky Ends 

Avigen received two 
grants from the NIH & 
University of Cali- 
fornia for research 
on gene therapy for 
treatment of cancer & 
HIV infections. . .KRL 
Pharmaceutical Servi- 
ces, of Rest on, VA, 
launched the TSN Bug 
Finder, which is able 
to locate & retrieve 
client -specified mi- 
croorganisms in real- 
time. . .Oensia Sicor, 
Inc. will move its 
corporate staff from 
San Diego to Irvine, 
CA, by end of year. . , 



FDA accepted NDA from 
Sepracor for levalbu- 
terol HC1 inhalation 
solution. . .An $11. 7M 
mezzanine financing 
has been closed by 
Activated Cell Thera- 
py, which changed its 
name to Dendreon Cor- 
poration. . .Astra AB 
will build major re- 
search facility in 
Waltham, MA, and is 
also relocating Astra 
Arcua research facil- 
ity from Rochester to 
Boston area. . .Prolif- 
ix Ltd. team used a 
small peptide to in- 
hibit the E2F protein 
complex and induced 



apoptosis in mammali- 
an tumor cells... Ver- 
tex Pharmaceuticals, 
Inc. and Alpha Thera- 
peutic Corp. ended an 
agreement to develop 
VX-366 for treatment 
of inherited hemoglo- 
bin disorders. . .Navi- 
Cyte received Phase I 
SBIR grant for up to 
$100,000 from NIH for 
development of proto- 
type of its NaviFlow 
technology for high- 
throughput screening 
...Covan.ce Inc. will 
invest $21 million in 
expansion and renova- 
tion of its facility 
in Indianapolis, IN. 
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merely a moans to an end. The criti- 
cal next step is to validate the gene 
and its protein product as a potential 
drug target. The Human Genome 
Project continues to produce a trea- 
sure chest of expressed sequence 
tags (ESTs) and a tantalizing array of 
complete gene sequences. 

Companies are applying a variety 
of functional genomic strategies to 
link genes to specific diseases and to 
multigenic phenotypes. Yet the ulti- 
mate challenge for pharmaceutical 
companies is to sift through all the 
sequence and differential gene 
expression data to identify the best 
targets for drug discovery. 

Spinning off technology devel- 
oped at the University of North 
Carolina (Chapel Hill), Cytogen 
Corp. (Princeton, NJ) formed its 
wholly owned subsidiary AxCell 
Biosciences earlier this year. The 
young company is building a protein 
interaction database, cataloging all 
the interactions the modular domains 
of proteins can engage in with a 



range of ligands, in order to gain 
insight into protein function and to 
select the most critical interaction to 
target for drug development. 

AxCcll s cloning-of-ligand-targcts 
(COLT) technology employs "recog- 
nition units" from the company* 
genetic diversity library (GDU to 
map functional protein interactions 
and quantitatc their affinity. The 
company's intcr-functional protcom- 
ic database (IFP-dbasc) elucidates 
protein interaction networks and 
structure-activity relationships based 
on ligand affinity with protein mod- 
ular domains. 

Defining Disease Pathways 

Signal Pharmaceuticals, Inc.'s 

(San Diego, CA) integrated drug tar- 
get and discovery effort is based on 
mapping gene-regulating pathways in 
cells and identifying small molecules 
that regulate the activation of those 
genes. In collaboration with academ- 
ic researchers, the company has iden- 
tified a large number of regulatory 
proteins in several m i togen-acttvated 
protein (MAP) kinase pathways 
(including the JNK, FRK and p38 
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signaling pathways), which Signal is 
evaluating for the treatment of 
autoimmune, inflammatory, cardio- 
vascular and neurologic diseases, and 
cancer. Other target identification 



programs focus on the NF-kB path- 
way, estrogen-related genes and cen- 
tral/peripheral nervous system genes. 

Regulating cytokine production in 
immune and inflammatory disorders. 
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and modifying bone metabolism to 
treat osteoporosis arc the focus of 
Signal* collaboration with Tanabe 
Seivaku (Osaka, Japan). Signal has 
partnered with OrganonfAkzo 
Nobel (Netherlands) to identify 
estrogen-rcspon sivc genes as targets 
for treating neurodegenerative and 
psychiatric diseases, atherosclerosis 
and ischemia, and with Roche 
Bioscience (Palo Alto. CA) lo devel- 
op human peripheral nerve cell lines 
for the discovery of treatments for 
pain and incontinence. 

Exetixis' (S. San Francisco, CA) 
strategy for target selection is to 
define disease pathways and identify 
regulatory molecules that activate or 
inhibit those biochemical/genetic 
pathways. Based on the finding that 
these pathways are conserved across 
species, the company is studying the 
model genetic systems of Drosophila 
and Caenorhabditis elegans. Using 
its Pathfinder technology, Exelixis 
systematically introduces mutations 
into the genomes of these model 
organisms, looking for mutations 
that enhance or suppress the target 
disease-related gene. These novel 
genes then become the basis of drug 
screening assays. 

Cadus Pharmaceutical Corp. 
(Tarrytown, NY) is identifying sur- 
rogate ligands to newly discovered 
orphan G-protcin coupled trans- 
membrane receptors of unknown 
function to determine the suitability 
of the receptors as drug targets. 
Inserting the novel receptor in a 
yeast system yields a ligand that 
activates the receptor. Access to a 
surrogate ligand allows the company 
to screen for receptor antagonists in 
the yeast system. 

•The antagonist plus the surro- 
gate ligand gives you two probes — 
an on probe and an off probe — 
which allows you to look at func- 
tion," explains David Webb, Ph.D., 
vp of research and chief scientific 
officer. A surrogate ligand also pro- 
vides information on which G-pro- 
tein interacts with the orphan recep- 
tor and its associated signaling path- 
ways, further clarifying the role of 
the receptor as a potential drug tar- 
get. Cadus' collaboration with 
Smith Kline (Philadelphia) capital- 
izes on Cadus* ability to determine 
orphan receptor function, applying 
the technology to SmithKline"s pro- 
prietary, newly discovered G-pro- 
tein receptors. 

Cadus' recombinant yeast system 
can also be used to screen cell and 
tissue extracts for natural ligands, 
and the company is accelerating its 
internal drug-discovery efforts in the 
areas of cancer, inflammation and 
allergy. A recent equity investment in 
Axiom Biotechnologies (San Diego. 
CA) gave Cadus a license to Axiom's 
high-throughput pharmacologic 
screening system for lead optimiza- 
tion and discovery. 

As its name implies. 
gene/Networks (Alameda, CA) 
focuses on identifying gene networks 
that contribute to multigenic pheno- 
types and complex disease process- 
es. The integration of mouse and 
human genetic studies forms the 
basis of the technology. The Genome 
Tagged Mice database in develop- 
ment will serve as a library of natur- 
al mouse genetic and phenotypic 
variation. Disease-related genes 
identified in mice are then evaluated 
in human family- and population- 
based studies to confirm their clini- 
cal relevance and linkages lo patho- 
physiologic troits. 



Blocking Gene Expression 

Inactivating a gene known to be 
expressed in association with a par- 
ticular disease is one approach to 
identifying appropriate therapeutic 
targets. The target validation and dis- 
covery program at Rlbozymc 
Pharmaceuticals. Inc. (Boulder. 
CO) applies the company's ribozymc 
tcchnolt)gy to achieve selective inhi- 
bition of gene expression in cell cul- 
ture and in animals. 

Correlation of the gene cxprcs- 
sii>n inhibition with phcnolype can 
SEE TARGET, P. 38 
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suggest the relative importance of 
the gene in disease pathology. The 
company's nuc lease-resistant 
ribozymes form the basis of a col- 
laboration with Schering AG 
(Germany) for drug target validation 
and the development of ribozyme- 
based therapeutic agents, and with 
Chiron Corp. (Emeryville, CA) for 
target validation. 

With several ami sense compounds 
now progressing through clinical tri- 
als, the concept of using oligonu- 
cleotides to inhibit gene activity is 
not new. But rather than focusing on 
therapeutics development, Sequitur, 
Inc. fNan'ck, MA) is creating anti- 
sense compounds for the purpose of 
determining gene function and vali- 
dating drug targets. Clients typically 
provide the one-year-old company 
with the sequence (or EST) of a 
potential gene target and, in return, 
Sequitur custom designs a series of 
three to six ami sense compounds that 
yield a three-to-ten-fold inhibition of 
the target gene in cell culture. The 
company also provides oligofectins, 
a series of cationic lipids, to deliver 
the oligonucleotides to a variety of 
cultured cells. 

"Differential expression informa- 
tion is just for correlation, it doesn't 
tell function or confirm what would 
be a good target," says Tod Woolf, 
PftD., director of technology devel- 
opment at Sequitur. Whereas, anti- 
sense compounds will inhibit a tar- 
get. Sequitur offers both phospho- 
rothioate DNA antisense com- 
pounds, and its proprietary Next 
Generation chimeric oligonu- 
cleotides, which have a higher 
hybridization affinity, greater speci- 
ficity and reduced toxicity, according 
to the company. 

Mining Pathogen Genomes 

Companies such as Human 
Genome Sciences (HGS; Rockvillc, 
MD). Incyte (Palo Alto, CA). 




AxCell Biosciences scientists say their technology enables the rapid and 
simple functional identification of the two essential molecular components 
of protein interaction networks: specific recognition units that bind distinct 
modular protein domains are identified and isolated using a combination 
structural/functional approach that uses both peptide phase display Genetic 
Diversity Libraries (GDL) and bioinformatics, and cloning of Ligand 
Targets (COLT) technology utilizes recognition units as Junctional probes to 
isolate families of interactor proteins. 



Millennium Pharmaceuticals Inc. 
(Cambridge, MA) and Genome 
Therapeutics (Waltham, MA) are 
relying on high-speed DNA sequenc- 
ing, positional cloning and other 
strategies to identify specific micro- 
bial genomic sites that would be 
good targets for infectious disease 
therapeutics. 

HGS recently completed sequenc- 
ing of the bacterial pathogen 
Streptococcus pneumoniae, which is 
the focus of an agreement with 
Hoffmann-La Roche (Basel, 
Switzerland). Roche will use the 
sequence data to develop new anti- 
infectives against S. pneumoniae. 
HGS and Roche have expanded their 
collaboration to include a nonexclu- 
sive license to access sequence infor- 
mation for the intestinal bacterium 
Enterococcus faecalis. 

Incyte Pharmaceuticals has com- 
pleted one- fold coverage of the 
Candida albicans genome, identify- 



ing 60% of the genes of this fungal 
pathogen. This genome will become 
part of the company's PathoSeq 
microbial database. Incyte recently 
introduced the ZooSeq animal gene 
sequence and expression database. 
The database will provide genomic 
in formation across various species 
commonly used in preclinical drug 
testing, which may help to better 
define potential drug targets. 

Millennium Pharmaceuticals con- 
tinues to report success in identifying 
novel drug targets, having recently 
discovered a novel chemokine called 
neurotactin and a new class of MAJ> 
related proteins that inhibit trans- 
forming growth factor beta (TGF-B) 
signaling. The company also 
received U.S. patent coverage for the 
rub genes, believed to play a role in 
obesity, and for the gene that encodes 
the protein melastatin, which appears 
to suppress metastasis in malignant 
melanoma. * 




HIGH SPECIFIC ACTIVITY 
MICROBIAL ALKALINE 
PHOSPHATASE 
from Biocatalysts 

Biocatalysts Limited, the British speciality enzyme 
company, has developed a completely new type of 
alkaline phosphatase with many advantages over the 
types most commonly used. 
It is of microbial origin with a high specific activity 
(unlike that from E coli) and with higher temperature and 
storage stability compared to that from calf Intestine. 
This is the first of several new generation diagnostic 
enzymes being developed by Biocatalysts Limited with 
greatly improved stability. 

• Non-animal source, no risk of BSE or animal 
virus contamination 

• Higher temperature stability than calf Intestine 

• Much higher specific activity than from E. coll 

• Very high storage stability even in the absence 
of glycerol 

For further details on alkaline phosphatase and our other 
diagnostic enzymes contact us direct at the address below or 
within North America contact our US Distributor KaltrorhPettibone 
'phone: 630350 1116 or tax: 630-350-1606 

Biocatalysts Limited 

Treforest Industrial Estate Pontypridd Weiss UK CF37 SUD 
Tel: +44 (0)1443 843712 Fas: +44 (0)1443 041214 
a-mail-KeUy@BiocatalystsxoiD. 
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Pangea 



Smith, now a computer program- 
mer, is an expert in systems integra- 
tion, Internet technologies and the 
application of industrial engineering 
principles to the drug discovery 

rxss. Before co-founding Pangea, 
was the manager of software 
development at Attorneys Briefcase, 
a legal research software company. 

By being "in the trenches'* with 
customers and collaborators, 
Bellenson and Smith sensed the 
frustration of pharmaceutical 
researchers whose incompatible 
tools have impeded their progress. 
According to Bellenson, "Most of 
them are geared toward analyzing 
one molecule at a rime. It's like emp- 
tying the ocean with an eye drop- 
per — an incompatible eye dropper at 
that. A pharmaceutical company 
may have 30 different drug discov- 
ery teams with various approaches. 
The problem is to manage the 
process of experimenting with a lot 
of different approaches, to automate 
while maintaining flexibility." 

GeneWorid 2.1 enables "integra- 
tion of the entire target discovery and 
validation process;* Bellenson says. 
The commercial software package 
coordinates the entire process of 
sequence-data analysis and can be 
integrated with other programs and 
databases, according to Smith, who 
adds that it handles thousands of 
sequence results, organizes and auto- 
mates annotation and seamlessly 
interacts with growing genome data- 
bases. Simple forms and menus 
enable users to turn raw sequence 
data into crucial knowledge for drug 
discovery by applying algorithms to 
sequences, creating custom analysis 
strategies and producing useful 
reports, without the need for writing 
computer code. GeneWorid 2.1 runs 
on a variety of platforms and operat- 
ing systems. 

Pairing industrial relational data- 
base-management systems with a 
web-browser interface, Pangea's 
Operating System of Drug 
Discovery"' is an open-computing 
framework that allows client/server 
and Java-enabled web-based tech- 
nologies to collect, organize and ana- 
lyze drug discovery information for 
pharmaceutical companies to simpli- 
fy and accelerate drug discovery. The 
technology unites automated 
genomics database analysis for drug 
target site selection, chemical infor- 
mation database analysis and large- 
scale combinatorial chemistry pro- 
ject management and high-through- 
put screening project management 
for drug lead efficacy analysts. 
Pangea officials maintain that these 
integrated elements provide a unified 
environment for chemists, biologists 
and others involved in the drug dis- 
covery process to work together with 
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Bioiaformatfcitts can design and 
save Strategies, such as the one 
shown here, that forward data 
through multiple-step analyses 
logically and automatically. 



organization can appty the same 
Strategies to their own data. 



commercial and public domain 
software. 

Pangea's Operating System of 
Drug Discovery can accommodate 
Sybase, Oracle or Informix relation- 
al database-management systems 
and any version of UNIX. It absorbs 
new data formats, databases, algo- 
rithms and analysis paradigms into 
the automated workflow without 
software modifications. Netscape 
Navigator™ provides a friendly user 
interface from PC, Macintosh, and 
UNIX workstations. 

In the near term, Pangea plans to 
complete its bioinformatics core 
with two more programs. Gene 
Foundry, a sample tracking and 
workflow sequence package for 
DNA sequence and fragment infor- 
mation, will also offer interaction 
with robots, reagent tracking and 
troubleshooting. Gene Thesaurus, 
the other package is a "warehouse 
of bioinformatics data," says 
Bellenson. ■ 



Europe 



GTAC Chairman, Professor 
Norman C. Nevin, said 1996 saw 
"four important developments": an 
increase in enquiries and submis- 
sions made to GTAC; an increase in 
the complexity of submitted proto- 
cols; a continuing shift from gene 
therapy for single-gene disorders 
toward strategies aimed at tumour 
destruction in cancer; and a growth 
in international sponsorship of U.K. 
gene therapy trials. 

Since 1993. GTAC and its prede- 
cessor, the Clothier Committee, have 
approved 18 U.K. gene therapy clini- 
cal trials (13 of which have been car- 
ried out), which are listed in the 
report. The disease areas targeted by 
these trials include severe combined 
immunodeficiency (1 trial), cystic 
fibrosis (6), metastatic melanoma (2), 
lymphoma (2), neuroblastoma (I), 
breast cancer (1), Hurler's syndrome 
( I ), cervical cancer ( I ), glioblastoma 



breast cancer, breast cancer with liver 
metastases, glioblastoma, malignant 
ascites due to gastrointestinal cancer 
and ovarian cancer. 

Copies of the GTAC thrid annual 
report are available from the GTAC 
Secretariat, Wellington House. 133- 
155 Waterloo Road, London SE1 
8UG. UK. 

Coated Lenses Prevent PCO 

Scientists in the U.K. say it may be 
possible to prevent posterior capsule 
opacification (PCO), a common 
complication following cataract 
surgery, by using the implanted poly- 
methylmethacrylate (PMMA) 
intraocular lens as a drug delivery 
system. PCO occurs in 30-50% of 
cataract surgery patients as a result of 
stimulated cell growth within the 
remaining capsular bag. The condi- 
tion causes a decline in visual acuity 
and requires expensive laser treat- 
ment, thus negating the routine use of 
cataract surgery in underdeveloped 
countries, explains G. Duncan, at the 
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Exploring the Metabolic and Genetic Control of 
Gene Expression on a Genomic Scale 

Joseph L DeRisi, Vishwanath R. Iyer, Patrick O. Brown* 

DNA microarrays containing virtually every gene of Saccharomyces cerevisiae were used 
to carry out a comprehensive investigation of the temporal program of gene expression . 
accompanying the metabolic shift from fermentation to respiration. The expression 
profiles observed for genes with known metabolic functions pointed to features of the 
metabolic reprogramming that occur during the diauxic shift, and the expression patterns 
of many previously uncharacterized genes provided clues to their possible functions. The 
same DNA microarrays were also used to identify genes whose expression was affected 
by deletion of the transcriptional co-repressor TUP1 or overexpression of the transcrip- 
tional activator YAP1 . These results demonstrate the feasibility and utility of this ap- 
proach to genomewide exploration of gene expression patterns. 




Xhe complete sequences of nearly a dozen 
microbial genomes are known, and in the 
next several years we expect to know the 
complete genome sequences of several 
metazoans, including the human genome. 
Defining the role of each gene in these 
genomes will be a formidable task, and un- 
derstanding how the genome functions as a 
whole in the complex natural history of a 
living organism presents an even greater 
challenge. 

Knowing when and where a gene is 
expressed often provides a strong clue as to 
its biological role. Conversely, the pattern 
of genes expressed in a cell can provide 
detailed information about its state. Al- 
though regulation of protein abundance in 
a cell is by no means accomplished solely 
by regulation of mRNA, virtually all dif- 
ferences in cell type or state are correlated 
with changes in the mRNA levels of many 
genes. This is fortuitous because the only 
specific reagent required to measure the 
abundance of the mRNA for a specific 
gene is a cDNA sequence. DNA microar- 
rays, consisting of thousands of individual 
gene sequences printed in a high-density 
array on a glass microscope slide (J, 2), 
provide a practical and economical tool 
for studying gene expression on a very 
large scale (3-6). 

Saccharomyces cerevisiae is an especially 
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favorable organism in which to conduct a 
systematic investigation of gene expression. 
The genes are easy to recognize in the ge- 
nome sequence, cis regulatory elements are 
generally compact and close to the tran- 
scription units, much is already known 
about its genetic regulatory mechanisms, 
and a powerful set of tools is available for its 
analysis. 

A recurring cycle in the natural history 
of yeast involves a shift from anaerobic 
(fermentation) to aerobic (respiration) me- 
tabolism. Inoculation of yeast into a medi- 
um rich in sugar is followed by rapid growth 
fueled by fermentation, with the production 
of ethanol. When the fermentable sugar is 
exhausted, the yeast cells turn to ethanol as 
a carbon source for aerobic growth. This 
switch from anaerobic growth to aerobic 
respiration upon depletion of glucose, re- 
ferred to as the diauxic shift, is correlated 
with widespread changes in the expression 
of genes involved in fundamental cellular 
processes such as carbon metabolism, pro- 
tein synthesis, and carbohydrate storage 
(7). We used DNA microarrays to charac- 
terize the changes in gene expression that 
take place during this process for nearly the 
entire genome, and to investigate the ge- 
netic circuitry that regulates and executes 
this program. 

Yeast open reading frames (ORFs) were 
amplified by the polymerase chain reaction 
(PGR), with a commercially available set of 
primer pairs (8). DNA microarrays, con- 
taining approximately 6400 distinct DNA 
sequences, were printed onto glass slides by 



using a simple robotic printing device (9). 
Cells from an exponentially growing culture 
of yeast were inoculated into fresh medium 
and grown at 30°C for 21 hours. After an 
initial 9 hours of growth, samples were har- 
vested at seven successive 2-hour intervals, 
and mRNA was isolated (10). Fluorescently 
labeled cDN A was prepared by reverse tran- 
scription in the presence of Cy3(green)- 
or Cy5( red) -labeled deoxyuridine triphos- 
phate (dUTP) (11) and then hybridized to 
the microarrays (12). To maximize the re- 
liability with which changes in expression 
levels could be discerned, we labeled cDN A 
prepared from cells at each successive time 
point with Cy5, then mixed it with a Cy3- 
labeled "reference" cDNA sample prepared 
from cells harvested at the first interval 
after inoculation. In this experimental de- 
sign, the relative fluorescence intensity 
measured for the Cy3 and Cy5 fluors at 
each array element provides a reliable mea- 
sure of the relative abundance of the corre- 
sponding mRNA in the two cell popula- 
tions (Fig. 1). Data from the series of seven 
samples (Fig. 2), consisting of more than 
43,000 expression- ratio measurements, 
were organized into a database to facilitate 
efficient exploration and analysis of the 
results. This database is publicly available 
on the Internet (13). 

During exponential growth in glucose- 
rich medium, the global pattern of gene 
expression was remarkably stable. Indeed, 
when gene expression patterns between the 
first two cell samples (harvested at a 2-hour 
interval) were compared, mRNA levels dif- 
fered by a factor of 2 or more for only 19 
genes (0.3%), and the largest of these dif- 
ferences was only 2. 7 -fold (14). However, as 
glucose was progressively depleted from the 
growth media during the course of the ex- 
periment, a marked change was seen in the 
global pattern of gene expression. mRNA 
levels for approximately 710 genes were 
induced by a factor of at least 2, and the 
mRNA levels for approximately 1030 genes 
declined by a factor of at least 2. Messenger 
RNA levels for 183 genes increased by a 
factor of at least 4, and mRNA levels for 
203 genes diminished by a factor of at least 
4. About half of these differentially ex- 
pressed genes have no currently recognized 
function and are not yet named. Indeed, 
more than 400 of the differentially ex- 
pressed genes have no apparent homology 
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to any gene whose function is known (15). 
The responses of these previously unchar- 
acterized genes to the diauxic shift therefore 
provides the first small clue to their possible 
roles. 

The global view of changes in expres- 
sion of genes with known functions pro- 
vides a vivid picture of the way in which 
the cell adapts to a changing environ- 
ment. Figure 3 shows a portion of the yeast 
metabolic pathways involved in carbon 
and energy metabolism. Mapping the 
changes we observed in the mRNAs en- 
coding each enzyme onto this framework 
allowed us to infer the redirection in the 
flow of metabolites through this system. 
We observed large inductions of the genes 
coding for the enzymes aldehyde dehydro- 
genase (ALD2) and acetyl-coenzyme 
A(CoA) synthase (ACSi), which func- 
tion together to convert the products of 
alcohol dehydrogenase into acetyl -Co A, 
which in turn is used to fuel the tricarbox- 
ylic acid (TCA) cycle and the glyoxylate 
cycle. The concomitant shutdown of tran- 
scription of the genes encoding pyruvate 
decarboxylase and induction of pyruvate 
carboxylase rechannels pyruvate away 
from acetaldehyde, and instead to oxalac- 
etate, where it can serve to supply the 
TCA cycle and gluconeogenesis. Induc- 
tion of the pivotal genes PCKl, encoding 
phosphoenolpyruvate carboxykinase, and 
FBPJ, encoding fructose 1,6-biphos- 
phatase, switches the directions of two key 
irreversible steps in glycolysis, reversing 
the flow of metabolites along the revers- 
ible steps of the glycolytic pathway toward 
the essential biosynthetic precursor, glu- 
coses-phosphate. Induction of the genes 
coding for the trehalose synthase and gly- 
cogen synthase complexes promotes chan- 
neling of glucose-6-phosphate into these 
carbohydrate storage pathways. 

Just as the changes in expression of 
genes encoding pivotal enzymes can pro- 
vide insight into metabolic reprogram- 
ming, the behavior of large groups of func- 
tionally related genes can provide a broad 
view of the systematic way in which the 
yeast cell adapts to a changing environ- 
ment (Fig. 4). Several classes of genes, 
such as cytochrome c-related genes and 
those involved in the TCA/glyoxylate cy- 
cle and carbohydrate storage, were coordi- 
nate^ induced by glucose exhaustion. In 
contrast, genes devoted to protein synthe- 
sis, including ribosoma! proteins, tRNA 
synthetases, and translation, elongation, 
and initiation factors, exhibited a coordi- 
nated decrease in expression. More than 
95% of ribosomal genes showed at least 
twofold decreases in expression during the 
diauxic shift (Fig. 4) (13). A noteworthy 
and illuminating exception was that the 



genes encoding mitochondrial ribosomal 
genes were generally induced rather than 
repressed after glucose limitation, high- 
lighting the requirement for mitchondrial 
biogenesis (13). As more is learned about 
the functions of every gene in the yeast 
genome, the ability to gain insight into a 
cell's response to a changing environment 
through its global gene expression patterns 
will become increasingly powerful. 

Several distinct temporal patterns of ex- 
pression could be recognized, and sets of 
genes could be grouped on the basis of the 
similarities in their expression patterns. The 
characterized members of each of these 
groups also shared important similarities in 
their functions. Moreover, in most cases, 
common regulatory mechanisms could be 
inferred for sets of genes with similar expres- 
sion profiles. For example, seven genes 
showed a late induction profile, with mRNA 
levels increasing by more than ninefold at 



the last timepoint but less than threefold at 
the preceding timepoint (Fig. 5B). All of 
these genes were known to be glucose-re- 
pressed, and five of the seven were previously 
noted to share a common upstream activat- 
ing sequence (UAS), the carbon source re- 
sponse element (CSRE) (16-20). A search 
in the promoter regions of the remaining two 
genes, ACRl and \D?2> revealed that 
ACR1, a gene essential for ACSI activity, 
also possessed a consensus CSRE motif, but 
interestingly, 1DP2 did not. A search of the 
entire yeast genome sequence for the con- 
sensus CSRE motif revealed only four addi- 
tional candidate genes, none of which 
showed a similar induction. 

Examples from additional groups of 
genes that shared expression profiles are 
illustrated in Fig. 5, C through F. The 
sequences upstream of the named genes in 
Fig. 5C all contain stress response ele- 
ments (STRE), and with the exception 
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Fig. 1. Yeast genome microarray. The actual size of the microarray is 18 mm by 18 mm. The 
microarray was printed as described (9). This image was obtained with the same fluorescent 
scanning confocal microscope used to collect all the data we report {49). A fluorescently labeled 
cDNA probe was prepared from mRNA isolated from cells harvested shortly after inoculation (culture 
density of <5 x 10 6 cells/ml and media glucose level of 19 g/liter) by reverse transcription in the 
presence of Cy3-dUTP. Similarly, a second probe was prepared from mRNA isolated from cells taken 
from the same culture 9.5 hours later (culture density of ~2 x 10 8 cells/ml, with a glucose level of 
<0.2 g/liter) by reverse transcription in the presence of Cy5-dUTP. In this image, hybridization of the 
Cy3-dUTP-labeled cDNA (that is, mRNA expression at the initial timepoint) is represented as a green 
signal, and hybridization of Cy5-dUTP-labeled cDNA (that is, mRNA expression at 9.5 hours) is 
represented as a red signal. Thus, genes induced or repressed after the diauxic shift appear in this 
image as red and green spots, respectively. Genes expressed at roughly equal levels before and after 
the diauxic shift appear in this image as yellow spots. 
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of HSP42, have previously been shown to 
be controlled at least in part by these 
elements (21-24). Inspection of the se- 
quences upstream of HSP42 and the two 
uncharacterized genes shown in Fig. 5C, 
YKL026c, a hypothetical protein with 
similarity to glutathione peroxidase, and 
YGR043c, a putative transaldolase, re- 
vealed that each of these genes also pos- 
sess repeated upstream copies of the stress- 
responsive CCCCT motif. Of the 13 ad- 
ditional genes in the yeast genome that 
shared this expression profile [including 
HSP30, ALD2, OM45, and 10 uncharac- 
terized ORFs (25)], nine contained one or 
more recognizable STRE sites in their up- 
stream regions. 

The heterotrimeric transcriptional acti- 
vator complex HAP2,3,4 has been shown 
to be responsible for induction of several 
genes important for respiration (26-28). 
This complex binds a degenerate consensus 
sequence known as the CCAAT box (26). 
Computer analysis, using the consensus se- 
quence TNRYTGGB (29), has suggested 
that a large number of genes involved in 
respiration may be specific targets of 
HAP2,3,4 (30). Indeed, a putative 
HAP2,3,4 binding site could be found in 
the sequences upstream of each of the seven 
cytochrome c-related genes that showed 
the greatest magnitude of induction (Fig. 
5D). Of 12 additional cytochrome c-related 
genes that were induced, HAP2,3,4 binding 
sites were present in all but one. Signifi- 
cantly, we found that transcription of 
HAP4 itself was induced nearly ninefold 
concomitant with the diauxic shift. 

Control of ribosomal protein biogenesis 
is mainly exerted at the transcriptional 
level, through the presence of a common 
upstream-activating element (UAS ) 
that is recognized by the Rapl DNA-bind- 
ing protein (3 J, 32). The expression pro- 
files of seven ribosomal proteins are shown 
in Fig. 5F. A search of the sequences 
upstream of all seven genes revealed con- 
sensus Rapl -binding motifs (33). It has 
been suggested that declining Rapl levels 
in the cell during starvation may be re- 
sponsible for the decline in ribosomal pro- 
tein gene expression (34). Indeed, we ob- 
served that the abundance of RAPl 
mRNA diminished by 4.4-fold, at about 
the time of glucose exhaustion. 

Of the 149 genes that encode known or 
putative transcription factors, only two, 
HAP4 and SIP4, were induced by a factor of 
more than threefold at the diauxic shift. 
S1P4 encodes a DNA-binding transcrip- 
tional activator that has been shown to 
interact with Snfl , the "master regulator" of 
glucose repression (35). The eightfold in- 
duction of SJP4 upon depletion of glucose 
strongly suggests a role in the induction of 



downstream genes at the diauxic shift. 

Although most of the transcriptional 
responses that we observed were not pre- 
viously known, the responses of many 
genes during the diauxic shift have been 
described. Comparison of the results we 
obtained by DNA microarray hybridiza- 
tion with previously reported results there- 
fore provided a strong test of the sensitiv- 
ity and accuracy of this approach. The 
expression patterns we observed for previ- 
ously characterized genes showed almost 
perfect concordance with previously pub- 
lished results (36). Moreover, the differ- 
ential expression measurements obtained 
by DNA microarray hybridization were re- 
producible in duplicate experiments. For 
example, the remarkable changes in gene 
expression between cells harvested imme- 
diately after inoculation and immediately 
after the diauxic shift (the first and sixth 
intervals in this time series) were mea- 
sured in duplicate, independent DNA mi- 
croarray hybridizations. The correlation 
coefficient for two complete sets of expres- 
sion ratio measurements was 0.87, and for 
more than 95% of the genes, the expres- 



sion ratios measured in these duplicate 
experiments differed by less than a factor 
of 2. However, in a few cases, there were 
discrepancies between our results and pre- 
vious results, pointing to technical limita- 
tions that will need to be addressed as 
DNA microarray technology advances 
(37, 38). Despite the noted exceptions, 
the high concordance between the results 
we obtained in these experiments and 
those of previous studies provides confi- 
dence in the reliability and thoroughness 
of the survey. 

The changes in gene expression during 
this diauxic shift are complex and involve 
integration of many kinds of information 
about the nutritional and metabolic state 
of the cell. The large number of genes 
whose expression is altered and the diver- 
sity of temporal expression profiles ob- 
served in this experiment highlight the 
challenge of understanding the underlying 
regulatory mechanisms. One approach to 
defining the contributions of individual 
regulatory genes to a complex program of 
this kind is to use DNA microarrays to 
identify genes whose expression is affected 



Fig. 2. The section of the ar- 
ray indicated by the gray box 
in Fig. 1 is shown for each of 
the experiments described 
here. Representative genes 
are labeled. In each of the ar- 
rays used to analyze gene 
expression during the diauxic 
shift, red spots represent 
genes that were induced rel- 
ative to the initial timepoint, 
and green spots represent 
genes that were repressed 
relative to the initial timepoint. 
In the arrays used to analyze 
the effects of the tupl A mu- 
tation and YAP1 overexpres- 
sion, red spots represent 
genes whose expression was 
increased, and green spots 
represent genes whose ex- 
pression was decreased by 
the genetic modification. Note 
that distinct sets of genes are 
induced and repressed in the 
different experiments. The 
complete images of each of 
these arrays can be viewed on 
the Internet (13). Cell density 
as measured by optical densi- 
ty (OD) at 600 nm was used to 
measure the growth of the 
culture. 
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by mutations in each putative regulatory 
gene. As a test of this strategy, we analyzed 
the genomewide changes in gene expression 
that result from deletion of the TUPl gene. 
Transcriptional repression of many genes by 
glucose requires the DNA-binding repressor 



Migl and is mediated by recruiting the tran- 
scriptional co-repressors Tupl and Cyc8/ 
Ssn6 (39). Tupl has also been implicated in 
repression of oxygen-regulated, mating-type- 
specific, and DNA-damage-inducible genes 
(40). 
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Fig. 3. Metabolic reprogramming inferred from global analysis of changes in gene expression. Only key 
metabolic intermediates are identified. The yeast genes encoding the enzymes that catalyze each step 
in this metabolic circuit are identified by name in the boxes. The genes encoding succinyl-CoA synthase 
and glycogen-debranching enzyme have not been explicitly identified, but the ORFs YGR244 and 
YPR184 show significant homology to known succinyl-CoA synthase and glycogen-debranching en- 
zymes, respectively, and are therefore included in the corresponding steps in this figure. Red boxes with 
white lettering identify genes whose expression increases in the diauxic shift. Green boxes with dark 
green lettering identify genes whose expression diminishes in the diauxic shift. The magnitude of 
induction or repression is indicated for these genes. For multimeric enzyme complexes, such as 
succinate dehydrogenase, the indicated fold-induction represents an unweighted average of all the 
genes listed in the box. Black and white boxes indicate no significant differential expression {Jess than 
twofold). The direction of the arrows connecting reversible enzymatic steps indicate the direction of the 
flow of metabolic intermediates, inferred from the gene expression pattern, after the diauxic shift. Arrows 
representing steps catalyzed by genes whose expression was strongly induced are highlighted in red. 
The broad gray arrows represent major increases in the flow of metabolites after the diauxic shift, 
inferred from the indicated changes in gene expression. 



Wild-type yeast cells and cells bearing 
a deletion of the TUPl gene (tupl A) were 
grown in parallel cultures in rich medium 
containing glucose as the carbon source. 
Messenger RNA was isolated from expo- 
nentially growing cells from the two pop- 
ulations and used to prepare cDNA la- 
beled with Cy3 (green) and Cy5 (red), 
respectively ( J J ). The labeled probes were 
mixed and simultaneously hybridized to 
the microarray. Red spots on the microar- 
ray therefore represented genes whose 
transcription was induced in the tup J A 
strain, and thus presumably repressed by 
Tupl (40- A representative section of the 
microarray (Fig. 2, bottom middle panel) 
illustrates that the genes whose expression 
was affected by the tupl A mutation, were, 
in general, distinct from those induced 
upon glucose exhaustion [complete images 
of all the arrays shown in Fig. 2 are avail- 
able on the Internet (13)]. Nevertheless, 
34 (10%) of the genes that were induced 
by a factor of at least 2 after the diauxic 
shift were similarly induced by deletion of 
TUPl , suggesting that these genes may be 
subject to TUPl -mediated repression by 
glucose. For example, SUC2, the gene en- 
coding invertase, and all five hexose trans- 
porter genes that were induced during the 
course of the diauxic shift were similarly 
induced, in duplicate experiments, by the 
deletion of TUPl. 

The set of genes affected by Tupl in this 
experiment also included a-glucosidases, 
the mating- type-specific genes MFA1 and 
MFA2, and the DNA damage-indue ible 
RNR2 and RNR4, as well as genes involved 
in flocculation and many genes of unknown 
function. The hybridization signal corre- 
sponding to expression of TUPl itself was 
also severely reduced because of the (in- 
complete) deletion of the transcription unit 
in the tup] A strain, providing a positive 
. control in the experiment (42). 

Many of the transcriptional targets of 
Tupl fell into sets of genes with related 
biochemical functions. For instance, al- 
though only about 3% of all yeast genes 
appeared to be TUPl -repressed by a factor 
of more than 2 in duplicate experiments 
under these conditions, 6 of the 13 genes 
that have been implicated in flocculation 
(15) showed a reproducible increase in 
expression of at least twofold when TUPl 
was deleted. Another group of related 
genes that appeared to be subject to TUPl 
repression encodes the serine-rich cell 
wall mannoproteins, such as Tipl and 
Tirl/Srpl which are induced by cold 
shock and other stresses (43), and similar, 
serine-poor proteins, the seripauperins 
(44). Messenger RNA levels for 23 of the 
26 genes in this group were reproducibly 
elevated by at least 2.5-fold in the tupl^ 
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strain, and 18 of these genes were induced 
by more than sevenfold when TUP1 was 
deleted. In contrast, none of 83 genes that 
could be classified as putative regulators of 
the cell division cycle were induced more 
than twofold by deletion of TUPl. Thus, 
despite the diversity of the regulatory sys- 
tems that employ Tupl, most of the genes 
that it regulates under these conditions 
fall into a limited number of distinct func- 
tional classes. 

Because the microarray allows us to 
monitor expression of nearly every gene in 
yeast, we can, in principle, use this ap- 
proach to identify all the transcriptional 
targets of a regulatory protein like Tupl. It 
is important to note, however, that in any 
single experiment of this kind we can only 
recognize those target genes that are nor- 
mally repressed (or induced) under the 
conditions of the experiment. For in- 
stance, the experiment described here an- 
alyzed a MAT a strain in which MFA1 
and MFA2, the genes encoding the a- 
factor mating pheromone precursor, are 
normally repressed. In the isogenic tup] A 
strain, these genes were inappropriately 
expressed, reflecting the role that Tupl 
plays in their repression. Had we instead 
carried out this experiment with a MATA 
strain (in which expression of MFAJ and 
MFA2 is not repressed), it would not have 
been possible to conclude anything re- 
garding the role of Tupl in the repression 
of these genes. Conversely, we cannot dis- 
tinguish indirect effects of the chronic 
absence of Tupl in the mutant strain from 
effects directly attributable to its partici- 
pation in repressing the transcription of a 
gene. 

Another simple route to modulating the 
activity of a regulatory factor is to overex- 
press the gene that encodes it. YAP] en- 
codes a DNA-binding transcription factor 
belonging to the b-zip class of DNA-bind- 
ing proteins. Overexpression of YAP J in 
yeast confers increased resistance to hydro- 
gen peroxide, o-phenanthroline, heavy 
metals, and osmotic stress (45). We ana- 
lyzed differential gene expression between a 
wild-type strain bearing a control plasmid 
and a strain with a plasmid expressing YAP] 
under the control of the strong GALl-10 
promoter, both grown in galactose (that is, 
a condition that induces YAP J overexpres- 
sion). Complementary DNA from the con- 
trol and YAP] overexpressing strains, la- 
beled with Cy3 and Cy5, respectively, was 
prepared from mRNA isolated from the two 
strains and hybridized to the microarray. 
Thus, red spots on the array represent genes 
that were induced in the strain overexpress- 
ing YAP I. 

Of the 17 genes whose mRNA levels 
increased by more than threefold when 



YAP! was overexpressed in this way, five 
bear homology to aryl-alcohol oxidoreduc- 
tases (Fig. 2 and Table 1). An additional 
four of the genes in this set also belong to 
the general class of dehydrogenases/oxi- 
doreductases. Very little is known about 
the role of aryl-alcohol oxidoreductases in 
S. cerevisiae, but these enzymes have been 
isolated from ligninolytic fungi, in which 
they participate in coupled redox reac- 
tions, oxidizing aromatic, and aliphatic 
unsaturated alcohols to aldehydes with the 
production of hydrogen peroxide (46, 47). 
The fact that a remarkable fraction of the 
targets identified in this experiment be- 
long to the same small, functional group of 
oxidoreductases suggests that these genes 

Fig. 4. Coordinated reg- 
ulation of functionally re- 
lated genes. The curves 
represent the average in- 
duction or repression ra- 
tios for all the genes in 
each indicated group. 
The total number of 
genes in each group was 
as follows: ribosomal 
proteins, 112; translation 
elongation and initiation 

factors, 25; tRNA synthetases (excluding mitochondial synthetases), 17; glycogen and trehalose syn- 
thesis and degradation, 15; cytochrome c oxidase and reductase proteins, 19; and TCA- and glyoxy- 
late-cycle enzymes, 24. 

Table 1 . Genes induced by YAP1 overexpression. This list includes all the genes for which mRNA levels 
increased by more than twofold upon YAP1 overexpression in both of two duplicate experiments, and 
for which the average increase in mRNA level in the two experiments was greater than threefold {50). 
Positions of the canonical Yap1 binding sites upstream of the start codon, when present, and the 
average fold-increase in mRNA levels measured in the two experiments are indicated. 



might play an important protective role 
during oxidative stress. Transcription of a 
small number of genes was reduced in the 
strain overexpressing Yapl. Interestingly, 
many of these genes encode sugar per- 
meases or enzymes involved in inositol 
metabolism. 

We searched for Yapl -binding sites 
(TTACTAA or TGACTAA) in the se- 
quences upstream of the target genes we 
identified (48). About two-thirds of the 
genes that were induced by more than 
threefold upon Yapl overexpression had 
one or more binding sites within 600 bases 
upstream of the start codon (Table 1), sug- 
gesting that they are directly regulated by 
Yapl. The absence of canonical Yapl-bind- 



— Glycogen/Trehalose 

Cytochrome-c 
— o- TCA / Glyoxalate cycle 



- Ribosomal proteins 
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— o— tRNA synthetase 




ORF 


Distance of Yapl 
site from ATG 


Gene 


Description 


Fold- 
increase 


YNL331C 






Putative aryl-alcohol reductase 


12.9 


YKL071W 


162-222 (5 sites) 




Similarity to bacterial csgA protein 


10.4 


YML007W 




YAP1 


Transcriptional activator involved in 
oxidative stress response 


9.8 


YFL056C 


223, 242 




Homology to aryl-alcohol 
dehydrogenases 


9.0 


YLL060C 


98 




Putative glutathione transferase 


7.4 


YOL165C 


266 




Putative aryl-alcohol dehydrogenase 
(NADP+) 


7.0 


YCR107W 






Putative aryl-alcohol reductase 


6.5 


YML116W 


409 


ATR1 


Aminotriazole and 4-nitroquinoline 
resistance protein 


6.5 


YBR008C 


142, 167,364 




Homology to benomyi/methotrexate 
resistance protein 


6.1 


YCLX08C 






Hypothetical protein 


6.1 


YJR155W 






Putative aryl-alcohol dehydrogenase 


6.0 


YPL171C 


148, 212 


OYE3 


NAPDH dehydrogenase {old yellow 
enzyme), isoform 3 


5.8 


YLR460C 


167, 317 




Homology to hypothetical proteins 
YCR102c and YNL134C 


4.7 


YKR076W 


178 




Homology to hypothetical protein 
YMR251w 


4.5 


YHR179W 


327 


OYE2 


NAD(P)H oxidoreductase (old yellow 
enzyme), isoform 1 


4.1 


YML131W 


507 




Similarity to A. thaliana zeta-crystallin 
homolog 


3.7 


YOL126C 




MDH2 


Malate dehydrogenase 


3.3 
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ing sites upstream of the others may reflect 
an ability of Yapl to bind sites that differ 
from the canonical binding sites, perhaps in 
cooperation with other factors, or less like- 
ly, may represent an indirect effect of Yapl 
overexpression, mediated by one or more 
intermediary factors. Yapl sites were found 
only four times in the corresponding region 
of an arbitrary set of 30 genes that were not 
differentially regulated by Yapl. 

Use of a DNA microarray to character- 
ize the transcriptional consequences of 
mutations affecting the activity of regula- 
tory molecules provides a simple and pow- 
erful approach to dissection and character- 
ization of regulatory pathways and net- 



works. This strategy also has an important 
practical application in drug screening. 
Mutations in specific genes encoding can- 
didate drug targets can serve as surrogates 
for the ideal chemical inhibitor or modu- 
lator of their activity. DNA microarrays 
can be used to define the resulting signa- 
ture pattern of alterations in gene expres- 
sion, and then subsequently used in an 
assay to screen for compounds that repro- 
duce the desired signature pattern. 

DNA microarrays provide a simple and 
economical way to explore gene expres- 
sion patterns on a genomic scale. The 
hurdles to extending this approach to any 
other organism are minor. The equipment 



required for fabricating and using DNA 
microarrays (9) consists of components 
that were chosen for their modest cost and 
simplicity. It was feasible for a small group 
to accomplish the amplification of more 
than 6000 genes in about 4 months and, 
once the amplified gene sequences were in 
hand, only 2 days were required to print a 
set of 110 microarrays of 6400 elements 
each. Probe preparation, hybridization, 
and fluorescent imaging are also simple 
procedures. Even conceptually simple ex- 
periments, as we described here, can yield 
vast amounts of information. The value of 
the information from each experiment of 
this kind will progressively increase as 
more is learned about the functions of 
each gene and as additional experiments 
define the global changes in gene expres- 
sion in diverse other natural processes and 
genetic perturbations. Perhaps the greatest 
challenge now is to develop efficient 
methods for organizing, distributing, inter- 
preting, and extracting insights from the 
large volumes of data these experiments 
will provide. 
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Fig. 5. Distinct temporal patterns of induction or repression help to group genes that share regulatory 
properties. (A) Temporal profile of the cell density, as measured by OD at 600 nm and glucose 
concentration in the media. (B) Seven genes exhibited a strong induction (greater than ninefold) only at 
the last timepoint (20.5 hours). With the exception of IDP2, each of these genes has a CSRE UAS. There 
were no additional genes observed to match this profile. (C) Seven members of a class of genes marked 
by early induction with a peak in mRNA levels at 18.5 hours. Each of these genes contain STRE motif 
repeats in their upstream promoter regions. (D) Cytochrome c oxidase and ubiquinol cytochrome c 
reductase genes. Marked by an induction coincident with the diauxic shift, each of these genes contains 
a consensus binding motif for the HAP2.3.4 protein complex. At least 17 genes shared a similar 
expression profile. (E) SAM1, GPP1, and several genes of unknown function are repressed before the 
diauxic shift, and continue to be repressed upon entry into stationary phase. (F) Ribosomal protein 
genes comprise a large class of genes that are repressed upon depletion of glucose. Each of the genes 
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ABSTRACT cDNA microarray technology is used to profile 
complex diseases and discover novel disease-related genes. In 
inflammatory disease such as rheumatoid arthritis, expression 
patterns of diverse cell types contribute to the pathology. We 
have monitored gene expression in this disease state with a 
microarray of selected human genes of probable significance in 
inflammation as well as with genes expressed in peripheral 
human blood cells. Messenger RNA from cultured macrophages, 
chondrocyte cell lines, primary chondrocytes, and synoviocytes 
provided expression profiles for the selected cytokines, chemo- 
kines, DNA binding proteins, and matrix-degrading metal- 
loproteinases. Comparisons between tissue samples of rheuma- 
toid arthritis and inflammatory bowel disease verified the in- 
volvement of many genes and revealed novel participation of the 
cytokine interleukin 3, chemokine Groa and the metal- 
loproteinase matrix metallo-elastase in both diseases. From the 
peripheral blood library, tissue inhibitor of metalloproteinase 1, 
ferritin light chain, and manganese superoxide dismutase genes 
were identified as expressed differentially in rheumatoid arthri- 
tis compared with inflammatory bowel disease. These results 
successftilry demonstrate the use of the cDNA microarray system 
as a general approach for dissecting human diseases. 



The recently described cDNA microarray or DNA-chip tech- 
nology allows expression monitoring of hundreds and thou- 
sands of genes simultaneously and provides a format for 
identifying genes as well as changes in their activity (1, 2). 
Using this technology, two-color fluorescence patterns of 
differential gene expression in the root versus the shoot tissue 
oiArabidopsis were obtained in a specific array of 48 genes (1). 
In another study using a 1000 gene array from a human 
peripheral blood library, novel genes expressed by T cells were 
identified upon heat shock and protein kinase C activation (3). 

The technology uses cDNA sequences or cDNA inserts of a 
library for PCR amplification that are arrayed on a glass slide with 
high speed robotics at a density of 1000 cDNA sequences per cm 2 . 
These microarrays serve as gene targets for hybridization to 
cDNA probes prepared from RNA samples of cells or tissues. A 
two-color fluorescence labeling technique is used in the prepa- 
ration of the cDNA probes such that a simultaneous hybridization 
but separate detection of signals provides the comparative anal- 
ysis and the relative abundance of specific genes expressed (1,2). 
Microarrays can be constructed from specific cDNA clones of 
interest, a cDNA library, or a select number of open reading 
frames from a genome sequencing database to allow a large-scale 
functional analysis of expressed sequences. 
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Because of the wide spectrum of genes and endogenous 
mediators involved, the microarray technology is well suited 
for analyzing chronic diseases. In rheumatoid arthritis (RA), 
inflammation of the joint is caused by the gene products of 
many different cell types present in the synovium and cartilage 
tissues plus those infiltrating from the circulating blood. The 
autoimmune and inflammatory nature of the disease is a 
cumulative result of genetic susceptibility factors and multiple 
responses, paracrine and autocrine in nature, from macro- 
phages, T cells, plasma cells, neutrophils, synovial fibroblasts, 
chondrocytes, etc. Growth factors, inflammatory cytokines 
(4), and the chemokines (5) are the important mediators of this 
inflammatory process. The ensuing destruction of the cartilage 
and bone by the invading synovial tissue includes the actions 
of prostaglandins and leukotrienes (6), and the matrix degrad- 
ing metalloproteinases (MMPs). The MMPs are an important 
class of Zn-dependent metallo-endoproteinases that can col- 
lectively degrade the proteoglycan and collagen components of 
the connective tissue matrix (7). 

This paper presents a study in which the involvement of . 
select classes of molecules in RA was examined. Also inves- 
tigated were 1000 human genes randomly selected from a 
peripheral human blood cell library. Their differential and 
quantitative expression analysis in cells of the joint tissue, in 
diseased RA tissue and in inflammatory bowel disease (IBD) 
tissues was conducted to demonstrate the utility of the mi- 
croarray method to analyze complex diseases by their pattern 
of gene expression. Such a survey provides insight not only into 
the underlying cause of the pathology, but also provides the 
opportunity to selectively target genes for disease intervention 
by appropriate drug development and gene therapies. 

METHODS 

Microarray Design, Development, and Preparation. Two ap- 
proaches for the fabrication of cDNA microarrays were used in 
this study. In the first approach, known human genes of probable 
significance in RA were identified. Regions of the clones, pref- 
erably 1 kb in length, were selected by their proximity to the 3' end 
of the cDNA and for areas of least identity to related and 
repetitive sequences. Primers were synthesized to amplify the 
target regions by standard PCR protocols (3). Products were 



Abbreviations: RA, rheumatoid arthritis; MMP, matrix-degrading 
metalloproteinase; IBD, inflammatory bowel disease; LPS, lipopoly- 
saccharide; PMA, phorbol 12-myristate 13-acetate; TNF-a, tumor 
necrosis factor a; IL, interleukin; TGF-/3, transforming growth factor 
/3; GCSF, granulocyte colony-stimulating factor; MIP, macrophage 
inflammatory protein; MIF, migration inhibitory factor; HME, human 
matrix metallo-elastase; RANTES, regulated upon activation, normal 
T cell expressed and secreted; Gel, gelatinase; VCAM, vascular cell 
adhesion molecule; ICE, IL-1 converting enzyme; PUMP, putative 
metalloproteinase; MnSOD, manganese superoxide dismutase; TIMP, 
tissue inhibitor of metalloproteinase; MCP, macrophage chemotactic 
protein. 
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verified by gel electrophoresis and purified with Qiaquick 96-weIl 
purification kit (Qiagen, Chatsworth, CA), lyophilized (Savant), 
and resuspended in 5 /xl of 3x standard saline citrate (SSC) buffer 
for arraying. In the second approach, the microarray containing 
the 1056 human genes from the peripheral blood lymphocyte 
library was prepared as described (3). 

Tissue Specimens. Rheumatoid synovial tissue was obtained 
from patients with late stage classic RA undergoing remedial 
synovectomy or arthroplasty of the knee. Synovial tissue was 
separated from any associated connective tissue or fat. One 
gram of each synovial specimen was subjected to RNA extrac- 
tion within 40 min of surgical excision, or explants were 
cultured in serum-free medium to examine any changes under 
in vitro conditions. For IBD, specimens of macroscopically 
inflamed lower intestinal mucosa were obtained from patients 
with Crohn disease undergoing remedial surgery. The hyper- 
trophied mucosal tissue was separated from underlying con- 
nective tissue and extracted for RNA. 

Cultured Cells. The Mono Mac-6 (MM6) monocytic cells 
(8) were grown in RPMI medium. Human chondrosarcoma 
SW1353 cells, primary human chondrocytes, and synoviocytes 
(9, 10) were cultured in DMEM; all culture media were 
supplemented with 10% fetal bovine serum, 100 iig/m\ strep- 
tomycin, and 500 units/ml penicillin. Treatment of cells with 
lipopolysaccharide (LPS) endotoxin at 30 ng/ml, phorbol 
12-myristate 13-acetate (PMA) at 50 ng/ml, tumor necrosis 
factor a (TNF-a) at 50 ng/ml, interleukin (IL)-l/3 at 30 ng/ml, 
or transforming growth factor-/3 (TGF-j3) at 100 ng/ml is 
described in the figure legends. 



Fluorescent Probe, Hybridization, and Scanning. Isolation of 
mRNA, probe preparation, and quantitation with Arabidopsis 
control mRNAs was essentially as described (3) except for the 
following minor modification. Following the reverse transcriptase 
step, the appropriate Cy3- and Cy5-labeled samples were pooled; 
mRNA degraded by heating the sample to 65°C for 10 min with 
the addition of 5 /xl of 0.5M NaOH plus 0.5 ml of 10 mM EDTA. 
The pooled cDNA was purified from unincorporated nucleotides 
by gel filtration in Centri-spin columns (Princeton Separations, 
Adelphia, NJ). Samples were lyophilized and dissolved in 6 ^tl of 
hybridization buffer (5X SSC plus 0.2% SDS). Hybridizations, 
washes, scanning, quantitation procedures, and pseudocolor rep- 
resentations of fluorescent images have been described (3). Scans 
for the two fluorescent probes were normalized either to the 
fluorescence intensity of Arabidopsis mRNAs spiked into the 
labeling reactions (see Figs. 2-4) or to the signal intensity of 
/3-actin and glyceraldehyde-3-phosphate dehydrogenase 
(GAPDH; see Fig. 5). 

RESULTS 

Ninety-Six-Gene Microarray Design. The actions of cytokines, 
growth factors, chemokines, transcription factors, MMPs, pros- 
taglandins, and leukotrienes are well recognized in inflammatory 
disease, particularly RA (11-14). Fig. 1 displays the selected genes 
for this study and also includes control cDNAs of housekeeping 
genes such as /3-actin and GAPDH and genes from Arabidopsis 
for signal normalization and quantitation (row A, columns 1-12). 

Defining Microarray Assay Conditions. Different lengths and 
concentrations of target DNA were tested by arraying PCR- 




Fig. 1. Ninety-six-element microarray design. The target element name and the corresponding gene are shown in the layout. Some genes have 
more than one target element to guarantee specificity of signal. For TNF the targets represent decreasing lengths of 1, 0.8, 0.6, 0.4, and 0.2 kb from 
left to right. 
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amplified products ranging from 0.2 to 1.2 kb at concentrations 
of 1 /xg//xl or less. No significant difference in the signal levels was 
observed within this range of target size and only with 0.2-kb 
length was a signal reduced upon an 8-fold dilution of the 1 /Jig/jxl 
sample (data not shown). In this study the average length of the 
targets was 1 kb, with a few exceptions in the range of ^300 bp, 
arrayed at a concentration of 1 /xg//xl. Normally one PCR pro- 
vided sufficient material to fabricate up to 1000 microarray targets. 

In considering positional effects in the development of the 
targets for the microarrays, selection was biased toward the 3' 
proximal regions, because the signal was reduced if the target 
fragment was biased toward the 5' end (data not shown). This 
result was anticipated since the hybridizing probe is prepared by 
reverse transcription with oligo(dT)-primed mRNA and is richer 
in 3' proximal sequences. Cross-hybridizations of probes to 
targets of a gene family were analyzed with the matrix metal- 



loproteinases as the example because they can show regions of 
sequence identities of greater than 70%. With collagenase-l 
(Col-1) and collagenase-2 (Col-2) genes as targets with up to 70% 
sequence identity, and stromelysin-1 (Strom- 1) and stromelysin-2 
(Strom-2) genes with different degrees of identity, our results 
showed that a short region of overlap, even with 70-90% se- 
quence identity, produced a low level of cross-hybridization. 
However, shorter regions of identity spread over the length of the 
target resulted in cross-hybridization (data not shown). For 
closely related genes, targets were designed by avoiding long 
stretches of homology. For members of a gene family two or more 
target regions were included to discriminate between specificity 
of signal versus cross-hybridization. 

Monitoring Differential Expression in Cultured Cell Lines. In 
R A tissue, the monocyte/macrophage population plays a prom- 
inent role in phagocytic and immunomodulatory activities. Typ- 
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Fig. 2. Time course for LPS/PMA-induced MM6 cells. Array elements are described in Fig. 1. (A) Pseudocolor representations of fluorescent 
scans correspond to gene expression levels at each time point. The array is made up of 8 Arabidopsis control targets and 86 human cDNA targets, 
the majority of which are genes with known or suspected involvement in inflammation. The color bars provide a comparative calibration scale 
between arrays and are derived from the Arabidopsis mRNA samples that are introduced in equal amounts during probe preparation. Fluorescent 
probes were made by labeling mRNA from untreated MM6 cells or LPS and PMA treated cells. mRNA was isolated at indicated times after 
induction. (B /-///) The two-color samples were cohybridized, and microarray scans provided the data for the levels of select transcripts at different 
time points relative to abundance at time zero. The analysis was performed using normalized data collected from 8-bit images. 
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ically these cells, when triggered by an immunogen, produce the 
proinflammatroy cytokines TNF and IL-1. We have used the 
monocyte cell line MM6 and monitored changes in gene expres- 
sion upon activation with LPS endotoxin, a component of Gram- 
negative bacterial membranes, and PMA, which augments the 
action of LPS on TNF production (15). RNA was isolated at 
different times after induction and used for cDNA probe prep- 
aration. From this time course it was clear that TNF expression 
was induced within 15 min of treatment, reached maximum levels 
in 1 hr, remained high until 4 hr and subsequently declined (Fig. 
24). Many other cytokine genes were also transiently activated, 
such as IL-la and IL-6, and granulocyte colony-stimulating 
factor (GCSF). Prominent chemokines activated were IL-8, mac- 
rophage inflammatory protein (MIP)-lj3, more so than MlP-la, 
and Groa or melanoma growth stimulatory factor. Migration 
inhibitory factor (MIF) expressed in the un induced state declined 
in LPS-activated cells. Of the immediate early genes, the notice- 
able ones were c-fos,fra-l, c-jun, NF-KBp50, and IkB, with c-rel 
expression observed even in the uninduced state (Fig. IB). These 
expression patterns are consistent with reported patterns of 
activation of certain LPS- and PMA- induced genes (12). Dem- 
onstrated here is the unique ability of this system to allow parallel 
visualization of a large number of gene activities over a period of 
time. 

SW1353 cells is a line derived from malignant tumors of the 
cartilage and behaves much like the chondrocytes upon stim- 
ulation with TNF and IL-1 in the expression of MMPs (9). In 
addition to confirming our earlier observations with Northern 
blots on Strom-1, Col-1, and Col-3 expression (9), gelatinase 
(Gel) A, putative metalloproteinase (PUMP)-l membrane- 
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Fig. 3. Time course for IL-1/3 and TNF-induced SW1353 cells 
using the inflammation array (Fig. 1). (A) Pseudocolor representation 
of fluorescent scans correspond to gene expression levels at each time 
point. (B l-\V) Relative levels of selected genes at different time points 
compared with time zero. 



type matrix metalloproteinase, tissue inhibitors of matrix 
metalloproteinases or tissue inhibitor of metalloproteinase 1 
(TIMP-1), -2, and -3 were also expressed by these cells together 
with the human matrix metallo-elastase (HME; Fig. 3/4). HME 
induction was estimated to be ^50-fold and was greater than 
any of the other MMPs examined (Fig. 3B). This result was 
unexpected because HME is reportedly expressed only by 
alveolar macrophage and placental cells (16). Expression of 
the cytokines and chemokines, IL-6, IL-8, MIF, and MIP-1/3 
was also noted. A variety of other genes, including certain 
transcription factors, were also up-regulated (Fig. 3), but the 
overall time-dependent expression of genes in the SW1353 
cells was qualitatively distinct from the MM6 cells. 

Quantitation of differential gene expression (Figs. IB and 
3B) was achieved with the simultaneous hybridization of 
Cy3-labeled cDNA from untreated cells and Cy5-labeled 
cDNA from treated samples. The estimated increases in 
expression from these microarrays for a select number of genes 
including IL-1/3, IL-8, MIP-1/3, TNF, HME, Col-1, Col-3, 
Strom-1, and Strom-2 were compared with data collected from 
dot blot analysis. Results (not shown) were in close agreement 
and confirmed our earlier observations on the use of the 
microarray method for the quantitation of gene expression (3). 

Expression Profiles in Primary Chondrocytes and Synovio- 
cytes of Human RA Tissue. Given the sensitivity and the 
specificity of this method, expression profiles of primary 
synoviocytes and chondrocytes from diseased tissue were 
examined. Without prior exposure to inducing agents, low level 
expression of c-jun, GCSF, IL-3, TNF-/3, MIF, and RANTES 
(regulated upon activation, normal T cell expressed and se- 
creted) was seen as well as expression of MMPs, GelA, 
Strom-1, Col-1, and the three TIMPs. In this case, Col-2 
hybridization was considered to be nonspecific because the 
second Col-2 target taken from the 3' end of the gene gave no 

A. Human synovial fibroblasts B. Human articular chondrocytes 





» & <J 










. "2 i. 


w et 


9 « 


0 & : 


t o » 


0 © 


.« ? 


©» ft- 




<? & 


C* ft 


0 m o 


» * 


m » 


C* © 








» * m 


a 


^ o * 


& o 




Ct & & 

















<j - 






• 








> * 






* ?- % 












€> <? 


* * ♦ 




a ■ i 




0 


L> 41 c 




* 




& 


S 






c? 


$ ■■■ 




© 


q « a 




O 





«• 


c» a o o u 


O 


- m 


O 'A' • 






o 




© O O 9 9 


« oo« 




o o o 0 


€> O *; © CO 


• 


0 & O 9» 




» 


0 • c> c? 


m i-- i o D & 




' n c> a • 9 



uninduced 



< O *i> i>£ - t; 

■ © © © : C C . € * 
O * f " t c, t o » 



PMA/IL-1 |i 



PMA/IL-1|; 



© 9 


o c> i2> <S? 












o 












O # 












• <** » o 








- - - m • p 


o o o » o 












e • © © o 


Q €> 




e 














o 






a © ❖ © 










TNF./yiL-1|*. 






TGFji 




g 2 ^ 


v I 4 ? 


/<? 







Fig. 4. Expression profiles for early passage primary synoviocytes and 
chondrocytes isolated from RA tissue, cultured in the presence of 10% 
fetal calf serum and activated with PMA and IL-1/3, or TNF and IL-1/3, 
or TGF-/3 for 18 hr. The color bars provide a comparative calibration scale 
between arrays and are derived from the Arabidopsis mRNA samples that 
are introduced in equal amounts during probe preparation 
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signal. Treatment more so with PMA and IL-1, than TNF and 
IL-1, produced a dramatic up-regulation in expression of 
several genes in both of these primary cell types. These genes 
are as follows: the cytokine IL-6, the chemokines IL-8 and 
Gro-la, and the MMPs; Strom-1, Col-1, Col-3, and HME; and 
the adhesion molecule, vascular cell adhesion molecule 1 
(VCAM-1). The surprise again is HME expression in these 
primary cells, for reasons discussed above. From these results, 
the expression profiles of synoviocytes and the chondrocytes 
appear very similar; the differences are more quantitative than 
qualitative. Treatment of the primary chondrocytes with the 
anabolic growth factor TGF-/3 had an interesting profile in that 
it produced a remarkable down-regulation of genes expressed 
in both the untreated and induced state (Fig. 4). 

Given the demonstrated effectiveness of this technology, a 
comparative analysis of two different inflammatory disease 
states was conducted with probes made from RA tissue and 
IBD samples. RA samples were from late stage rheumatoid 
synovial tissue, and IBD specimens were obtained from in- 
flamed lower intestinal mucosa of patients with Crohn disease. 
With both the 96-element known gene microarray and the 
1000-gene microarray of cDNAs selected from a peripheral 
human blood cell library (3), distinct differences in gene 
expression patterns were evident. On the 96-gene array, RA 
tissue samples from different affected individuals gave similar 
profiles (data not shown) as did different samples from the 
same individual (Fig. 5). These patterns were notably similar 
to those observed with primary synoviocytes and chondrocytes 
(Fig. 4). Included in the list of prominently up-reguiated genes 
are IL-6, the MMPs Strom-1, Col-1, GelA, HME, and in 
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Fig. 5. Expression profiles of RA tissue and IBD tissue (B). 
mRNA from R A tissue samples obtained from the same individual was 
isolated directly after excision (RA 21.5A) or maintained in culture 
without serum for 2 hr (RA 21.5B) or for 6 hr (RA 21.5C). Profiles 
from tissue samples of two other individuals (data not shown) were 
remarkably similar to the ones shown here. IBD-A and IBD-CI are 
from mRNA samples prepared directly after surgery from two sepa- 
rate individuals. For the IBD-CII probe, the tissue sample was cultured 
in medium without serum for 2 hr before mRNA preparation. 



certain samples PUMP, TIMPs, particularly TIMP-1 and 
TIMP-3, and the adhesion molecule VCAM. Discernible levels 
of macrophage chemotactic protein 1 (MCP-1), MIF and 
RANTES were also noted. IBD samples were in comparison, 
rather subdued although IL-1 converting enzyme (ICE), 
TIMP-1, and MIF were notable in all the three different IBD 
samples examined here. In IBD-A, one of three individual 
samples, ICE, VCAM, Groa, and MMP expression was more 
pronounced than in the others. 

We also made use of a peripheral blood cDNA library (3) 
to identify genes expressed by lymphocytes infiltrating the 
inflamed tissues from the circulating blood. With the 1046- 
element array of randomly selected cDNAs from this library, 
probes made from R A and IBD samples showed hybridizations 
to a large number of genes. Of these, many were common 
between the two disease tissues while others were differentially 
expressed (data not shown). A complete survey of these genes 
was beyond the scope of this study, but for this report we 
picked three genes that were up-regulated in the RA tissue 
relative to IBD. These cDNAs were sequenced and identified 
by comparison to the GenBank database. They are TIMP-1, 
apoferritin light chain, and manganese superoxide dismutase 
(MnSOD). Differential expression of MnSOD was only ob- 
served in samples of RA tissue explants maintained in growth 
medium without serum for anywhere between 2 to 16 hr. These 
results also indicate that the expression profile of genes can be 
altered .when explants are transferred to culture conditions. 

DISCUSSION 

The speed, ease, and feasibility of simultaneously monitoring 
differential expression of hundreds of genes with the cDNA 
microarray based system (1-3) is demonstrated here in the 
analysis of a complex disease such as RA. Many different cell 
types in the RA tissue; macrophages, lymphocytes, plasma cells, 
neutrophils, synoviocytes, chondrocytes, etc. are known to con- 
tribute to the development of the disease with the expression of 
gene products known to be proinflammatory. They include the 
cytokines, chemokines, growth factors, MMPs, eicosanoids, and 
others (7, 11-14), and the design of the 96-element known gene 
microarray was based on this knowledge and depended on the 
availability of the genes. The technology was validated by con- 
firming earlier observations on the expression of TNF by the 
monocyte cell line MM6, and of Col-1 and Col-3 expression in the 
chondrosarcoma cells and articular chondrocytes (9, 12). In our 
time-dependent survey the chronological order of gene activities 
in and between gene families was compared and the results have 
provided unprecedented profiles of the cytokines (TNF, IL-1, 
IL-6, GCSF, and MIF), chemokines (MIP«la, MIP-1/3, IL-8, and 
Gro-1), certain transcription factors, and the matrix metal- 
loproteinases (GelA, Strom-1, Col-1, Col-3, HME) in the mac- 
rophage cell line MM6 and in the SW1353 chondrosarcoma cells. 

Earlier reports of cytokine production in the diseased state had 
established a model in which TNF is a major participant in RA. 
Its expression reportedly preceded that of the other cytokines and 
effector molecules (4). Our results strongly support these results 
as demonstrated in the time course of the MM6 cells where TNF 
induction preceded that of IL-la and IL-/3 followed by IL-6 and 
GCSF. These expression profiles demonstrate the utility of the 
microarrays in determining the hierarachy of signaling events. 

In the SW1353 chondrosarcoma cells, all the known MMPs and 
TIMPs were examined simultaneously. HME expression was 
discovered, which previously had been observed in only the 
stromal cells and alveolar macrophages of smoker's lungs and in 
placental tissue. Its presence in cells of the RA tissue is mean- 
ingful because its activity can cause significant destruction of 
elastin and basement membrane components (16, 17). Expression 
profiles of synovial fibroblasts and articular chondrocytes were 
remarkably similar and not too different from the SW1353 cells, 
indicating that the fibroblast and the chondrocyte can play equally 
aggressive roles in joint erosion. Prominent genes expressed were 
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the MMPs, but chemokines and cytokines were also produced by 
these cells. The effect of the anabolic growth factor TGF-/3 was 
profoundly evident in demonstrating the down regulation of these 
catabolic activities. 

RA tissue samples undeniably reflected profiles similar to 
the cell types examined. Active genes observed were IL-3, IL-6, 
ICE, the MMPs including HME and TIMPs, chemokines IL-8, 
Groa, MIP, MIF, and RANTES, and the adhesion molecule 
VCAM. Of the growth factors, fibroblast growth factor j3 was 
observed most frequently. In comparison, the expression 
patterns in the other inflammatory state (i.e., IBD) were not 
as marked as in the RA samples, at least as obtained from the 
tissue samples selected for this study. 

As an alternative approach, the 1046 cDNA microarray of 
randomly selected genes from a lymphocyte library was used to 
identify genes expressed in RA tissue (3). Many genes on this 
array hybridized with probes made from both R A and IBD tissue 
samples. The results are not surprising because inflammatory 
tissue is abundantly supplied with cell types infiltrating from the 
circulating blood, made apparent also by the high levels of 
chemokine expression in RA tissue. Because of the magnitude of 
the effort required to identify all the hybridized genes, we have for 
this report chosen to describe only three differentially expressed 
genes mainly to verify this method of analysis. 

Of the large number of genes observed here, a fair number 
were already known as active participants in inflammatory dis- 
ease. These are TNF, IL-1, IL-6, IL-8, GCSF, RANTES, and 
VCAM. The novel participants not previously reported are 
HME, IL-3, ICE, and Groa. With our discovery of HME 
expression in RA, this gene becomes a target for drug interven- 
tion. ICE is a cysteine protease well known for its IL-1/3 process- 
ing activity (18), and recognized for its role in apoptotic cell death 
(19). Its expression in RA tissue is intriguing. IL-3 is recognized 
for its growth-promoting activity in hematopoietic cell lineages, is 
a product of activated T cells (20), and its expression in synovio- 
cytes and chondrocytes of RA tissue is a novel observation. 

Like IL-8, Groa, is a C-X-C subgroup chemokine and is a 
potent neutrophil and basophil chemoattractant. It down- 
regulates the expression of types I and III interstitial collagens 
(21, 22) and is seen here produced by the MM6 cells, in primary 
synoviocytes, and in R A tissue. With the presence of RANTES, 
MCP, and MIP-1/3, the C-C chemokines (23) migration and 
infiltration of monocytes, particularly T cells, into the tissue is 
also enhanced (5) and aid in the trafficking and recruitment of 
leukocytes into the RA tissue. Their activation, phagocytosis, 
degranulation, and respiratory bursts could be responsible for 
the induction of MnSOD in RA. MnSOD is also induced by 
TNF and IL-1 and serves a protective function against oxida- 
tive damage. The induction of the ferritin light chain encoding 
gene in this tissue may be for reasons similar to those for 
MnSOD. Ferritin is the major intracellular iron storage protein 
and it is responsive to intracellular oxidative stress and reactive 
oxygen intermediates generated during inflammation (24, 25). 
The active expression of TIMP-1 in RA tissue, as detected by 
the 1000-element array, is no surprise because our results have 
repeatedly shown TIMP-1 to be expressed in the constitutive 
and induced states of RA cells and tissues. 

The suitability of the cDNA microarray technology for 
profiling diseases and for identifying disease related genes is 
well documented here. This technology could provide new 



targets for drug development and disease therapies, and in 
doing so allow for improved treatment of chronic diseases that 
are challenging because of their complexity. 
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AAGAATGATGTGATCACAAAGGTGCGG-3 *). The 
mutations were rxnffrned by sequence anaVss. Af- 
ter mutagenesis, the 0.4-kb Bam Hi-Msc I fragment 
from the rnutageneed pC225 piasmids was trans- 
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Quantitative Monitoring of Gene Expression 
Patterns with a Complementary DNA Microarray 

Mark Schena,* Dari Shalon.'t Ronald W. Davis, 
Patrick O. Brown* 

A high-capacity system was developed to monitor the expression of many genes In 
parallel. Microarrays prepared by high-speed robotic printing of complementary DNAs on 
glass were used for quantitative expression measurements of the corresponding genes. 
Because of the small format and high density of the arrays, hybridization volumes of 2 
microliters could be used that enabled detection of rare transcripts In probe mixtures 
derived from 2 micrograms of total cellular messenger RNA. Differential expression 
measurements of 45 Arabidopsis genes were made by means of simultaneous two-color 
fluorescence hybridization. 



The temporal, developmental, topographi- 
cal, histological, and physiological patterns 
in which a gene is expressed provide clues to 
its biological role. The large and expanding 
database of complementary DNA (cDNA) 
sequences from many organisms ( / ) presents 
the opportunity of defining these patterns ac 
the level of the whole genome. 

For these studies, we used the small flow- 
ering plant Arabidopsis thaliana as a model 
organism. Araoidopjis possesses many ad- 
vantages for gene expression analysis, in- 
cluding the fact that it has the smallest 
genome of any higher eukaryote examined 
to date (2). Forty-five cloned Arabidopsis 
cDNAs (Table 1), including 14 complete 
sequences and 31 expressed sequence tags 
(ESTs), were used as gene-specific targets. 
We obtained the ESTs by selecting cDNA 
clones at random from an Arabidopsis 
cDNA library. Sequence analysis revealed 
that 28 of the 31 ESTs matched sequences 
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in the database (Table 1 ). Three additional 
cDN As from other organisms served as con- 
trols in the experiments. 

The 48 cDNAs, averaging -1.0 kb f 
were amplified with the polymerase chain 
reaction (PCR) and deposited into indi- 
vidual wells of a 96-well microliter plate. 
Each sample was duplicated in two adja- 
cent wells to allow the reproducibility of 
the arraying and hybriduation process to 
be tested. Samples from the microliter 
plate were printed onto glass microscope 
slides in an area measuring 3.5 mm by 5.5 
mm with the use of a high-speed arraying 
machine (3 ). The arrays were processed by 
chemical and heat treatment to attach the 
DNA sequences to the glass surface and 
denature them (3). Three arrays, printed 
in a single lot, were used for the experi- 
ments here. A single microtiter plate of 
PCR products provides sufficient material 
to print at least 500 arrays. 

Fluorescent probes were prepared from 
total Arabidopsis mRNA (4) by a single 
round of reverse transcription (5). The Ara- 
bidopsis mRNA was supplemented with hu- 
man acetylcholine receptor (AChR) mRNA 
at a dilution of 1 : 10,000 (w/w) before cDNA 
synthesis, to provide an internal standard for 
calibration (5). The resulting fluorescently 
labeled cDNA mixture was hybridized to an 
array at high stringency (6) and scanned 
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A 



with a laser (3). A high-sensitivity scan gave 
signals that saturated the detector at nearly 
all of the Arabidopsis target sites (Fig. 1A). 
Calibration relative to the AChR mRNA 
standard (Fig. 1 A) established a sensitivity 
limit of - 1 : 50,000. No detectable hybridiza- 
tion was observed to either the rat glucocor- 
ticoid receptor (fig. 1A) or the yeast TRP4 
(Fig. 1A) targets even at the highest scan- 
ning sensitivity. A moderate-sensitivity scan 



A High sensitivity 

1 2 3 4 5 G T 6 9 10 1t 12 

a *: i «. . C w> C.» L • : •. ; » 



c «.« C <* •' 



l <; * .■: 



of the same array allowed linear detection of 
the more abundant transcripts (Fig. IB). 
Quantitation of both scans revealed a range 
of expression levels spanning three orders of 
magnitude for the 45 genes tested (Table 2). 
RNA blots (7) for several genes (Fig. 2) 
corroborated the expression levels measured 
with the microarray to within a factor of 5 
(Tabic 2). 

Differential gene expression was invesri- 
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Fig, 1. Gene expression monitored with the use of cONA micfoarrays. Fkioresoent scans represented in 
pseuoocokx correspond to hytxidizaeon Intensities. Color bars were canted from the signal obtained 
with the use of known corwentrationstfhunrianAr^ Numbers and 

tetters on the axes mark the position of each cDNA. (A) High-sensitivity fluorescein scan after hytxidizatori 
with ftuorascein-labeled cDNA derived from wild-type plants. (B) Same array as In (A) but scanned at 
moderate sensftMty. (C and D) A single array was probed with a 1 : 1 mixture of fluoresced labeled cONA 
from vvfld-type plants and bssamine-labeted cDNA from HAT4 -transgenic plants. The single array was 
then scanned successively to detect the fluorescein fluorescence corresponding to mRNA from wild-type 
plants (Q and the lissamine fluorescence corresponding to mRNA from HAT4-transgenic plants (D) (E 
and F) A single array was probed with a 1:1 mixture of ftuorescetn-labeled cONA from root tissue and 
lissamine-laoeied cONA from leaf tissue. The single array was then scanned successively to detect the 
fluorescein florescence conesponding to mRNAs expressed in roots (E) and the lissamine fluorescence 
corresponding to mRNAs expressed in leaves (F). 



gated with a simultaneous, two-color hy- 
bridization scheme, which served to mini- 
miie experimental variation inherent in die 
comparison of independent hybridization*. 
Fluorescent probes were prepared from two 
mRNA sources with the use of reverse tran- 
scriptase in the presence of fluorescein* and 
lissamine-labcled nucleotide analogs, re- 
spectively (5). The two probes were then 
mixed together in equal proportions, hy- . 
bridized to a single array, and scanned sep- 
arately for fluorescein and lissamine emis- 
sion after independent excitation of the two 
fluorophores (3). 

To test whether overexpression of a sin- 
gle gene could be detected in a pool of total 
Awtoidopws mRNA, we used a microanay to 
analyze a transgenic line overexpressing the 
single transcription factor HAT4 (8). Fluo- 
rescent probes representing mRNA from 
wild-type and HAT4-transgenic plants were 
labeled with fluorescein and lissamine, re- 
spectively; the two probes were then mixed 
and hybridized to a single array. An intense 
hybridization signal was observed at the 
position of the HAT4 cDNA in the lissa- 
mine-specific scan (Fig. ID), but not in the 
fluorescein-specific scan of the same array 
(Fig. 1C). Calibration with AChR mRNA 
added to the fluorescein and lissamine 
cDNA synthesis reactions at dilutions of 
1:10,000 (Fig. 1C) and 1:100 (Fig. ID), 
respectively, revealed a 50-fold elevation of 
HAT4 mRNA in the transgenic line rela- 
tive to its abundance in wild-type plants 
(Table 2). This magnitude of HAT4 over- 
expression matched that inferred from the 
Northern (RNA) analysis within a factor of 
2 (Fig. 2 and Table 2). Expression of all the 
other genes monitored on the array differed 
by less than a factor of 5 between HAT4- 
transgenic and wild-type plants (Fig 1, C 



Wfldtype 



HAT4 transgenic 




CAB/ 



HAT4 



ROC1 



Human 
AChR 

20 2J0~ 0.2 
mRNA (nfl) 

Fig. 2. Gene expression monitored with RNA 
{Northern) blot analysis. Designated amounts of 
mRNA from wild-type and HA 74 -transgenic 
plants were spotted onto nylon membranes and 
probed with the cONAs indicated. Purified human 
AChR mRNA was used for calibration. 
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and D, and Tabic 2). Hybrid nation of flu- 
oresce in-labeled glucocorticoid receptor 
cDNA (Fig. 1C) and lissamine-labeled 
TRP4 cDNA (Fig. ID) verified the pres. 
ence of the negative control targets and the 
lack of optical cross talk between the two 
fluorophores. 

To explore a more complex alteration in 
expression patterns, we performed a second 
two-color hybridization experiment with 
fluorescein- and lissamine- labeled probes 
prepared from root and leaf mRNA, respec- 
tively. The scanning sensitivities for the 
two fluorophores were normalized by 
matching the signals resulting from AChR 



mRNA, which was added to both cDNA 
synthesis reactions at a dilution of 1:1000 
(Fig. 1 , E and F). A comparison of the scans 
revealed widespread differences in gene ex- 
pression berween root and leaf tissue (Fig. 1, 
E and F). The mRNA from the light-regu- 
lated CAB] gene was -500-fold more abun- 
dant in leaf (Fig. IF) than in root tissue 
(Fig. IE). The expression of 26 other genes 
differed berween root and leaf tissue by 
more than a factor of 5 (Fig. 1, E and F). 

The HAT4-transgenic line we examined 
has elongated hypocotyls, early flowering, 
poor germination, and altered pigmentation 
(8). Although changes in expression were 

l!^ 1 ' S ^ iCeS containecj 00 »* miaoarray. Shown is the position, the known or outative 
fur^n.arxJ the access^ nuriber 1 )Xi but ^eSZ 

in this study matched a sequence h the database. NAOH S 1^ ^t.l^L ! 
c^leotkte;ATPase. *^ rmttUnam.de adenr* 
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CONA 
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EST19 


09. 10 


GBF-1 
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ES723 
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EST29 
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C7.B 
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rGR 
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HAT1 
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d11.12 


HAT2 
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HATS 
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EST52 
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EST59 
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KNAT1 


f5,6 


EST60 
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EST69 


19. 10 


PPH1 


111.12 


EST 70 


91.2 


EST 75 
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EST 78 
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ROC1 


97.8 


EST82 
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EST83 
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EST84 
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h5.6 


SARI 
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EST100 
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Function 



Accession 
number 



Human AChR 
Actin 

NADH dehydrogenase 

Actin 1 

Unknown 

Actin 

Chlorophyll a/b binding 
Phosphoglycerate kinase 
Gtobereflic add biosynthesis 
Unknown 

G-box binding factor 1 
Elongation facta 
Aldolase 

G-box binding factor 2 
Chloroplast protease 
Unknown 
Cataiase 

Rat glucocorticoid receptor 

Unknown 

ATPase 

Homeobox-teucine zipper 1 
Light harvesting comptex 
Unknown 

Homeobox-leucine zipper 2 
Homeobox-teucine zipper 4 
Ptosrjrxxibutokinase 
Homeobox-teucine zipper 5 
Unknown 

Homeobox-teucine zipper 22 
Oxygen evolving 
Urknown 

Knorred-like horn eo box 1 
RuBisCO small subunrt 
Translation elongation factor 
Protein phosphatase 1 
Unknown 

ChJoroplast protease 

Unknown 

Cyciophfiin 

GTP binding 

Unknown 

Unknown 

Unknown 

Unknown 

Synaptobnevin 

Light harvesting comptex 

Light harvesting comptex 

Yeast tryptophan biosynthesis 



H36236 

227010 

M20016 

U36594T 

T45783 

M85150 

T44490 

L37126 

U36595t 

X63894 

X52256 

T04477 

X63895 

R87034 

T14152 

T22720 

M14053 

U36596t 

J04185 

U09332 

T04063 

T76267 

U09335 

M90394 

T04344 

M90416 

233675 
U09336 

T21749 

234607 

U14174 

X14564 

T42799 

U34803 

T44621 

T43698 

R65481 

L14844 

X59152 

233795 

T45278 

T13832 

R64816 

M90418 

218205 

X03909 

X04273 



observed tor HAT*, large changes in ex- 
pression were not observed for any of the 
other 44 genes we examined This was 
somewhat surprising, particularly because 
comparative analysis of leaf and toot tissue 
identified 27 differentially expressed genes. 
Analysis of an expanded set of genes may be 
required to identify genes whose expression 
changes upon HAT4 overexpresston; alter- 
natively, a comparison of mRNA popula- 
tions from specific tissues of wild-type and 
HAT4-transgenic plants may allow idenri- 
fi cat ion of downstream genes. 

At the current density of robotic printing, 
it is feasible to scale up the fabrication pro- 
cess to produce arrays containing 20,000 
cDNA targets. At this density, a single array 
would be sufficient to provide gene-specific 
targets encompassing nearly the entire rep- 
ertoire of expressed genes in the Ambidopsis 
genome (2). The availability of 20,274 ESTs 
from Arabidcpsis (I, 9) would provide a rich 
source of templates for such studies. 

The estimated 100,000 genes in the hu- 
man genome (10) exceeds the number of 
Arabidopsis genes by a factor of 5 (2). This 
modest increase in complexity suggests that 
similar cDNA microarrays, prepared from 
the rapidly growing repertoire of human 
ESTs (]), could be used to determine the 
expression patterns of tens of thousands of 
human genes in diverse cell types. Coupling 
an amplification strategy to the reverse 
transcription reaction (Ji) could make it 
feasible to monitor expression even in 
minute tissue samples. A wide variety of 
acute and chronic physiological and patho- 
logical conditions might lead to character- 
istic changes in the patterns of gene expres- 
sion in peripheral blood cells or other easily 
sampled tissues. In concert with cDNA mi- 
croarrays for monitoring complex expres- 
sion patterns, these tissues might therefore 
serve as sensitive in vivo sensors for clinical 
diagnosis. Microarrays of cDNAs could thus 
provide a useful link between human gene 
sequences and clinical medicine. 



Table 2. Gene expression moritoring by miaoar- 
ray and RNA blot analyses; tg, HAT^-transgenk: 
See Table 1 for additional gene Wormafon. Ex- 
pression levels fw/w) were calibrated with the use 
of known amounts of human AChR mRNA. Values 
for the microarray were determined from rnicroar* 
ray scans (Fig. 1); values for the RNA blot were 
determined from RNA Wots (Rg. 2). 



'Proprietary sequence of Slratagene (U Jdla, Ca/«omiaJ. TNo mate* «, the database; novel EST. 



Gene 


Expression level (w/w) 


Microarray 


RNA blot 


CABJ 


1:48 


1:83 


CABI (tg) 


1:120 


1:150 


HAT4 


1:8300 


1:6300 


HAT4 (tg) 


1:150 


1510 


ROC1 


1:1200 


1:1800 


ROC1 (tg) 


1560 


1:1300 
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Gene Therapy in Peripheral Blood 
Lymphocytes and Bone Marrow for 
ADA Immunodeficient Patients 

Claudio BordignoiV Luigi D. Notarangelo, Nadia Nobili, 
Giuliana Ferrari, Giulia Casorati, Paola Panina, Evelina Mazzolari, 
Daniela Maggioni, Claudia Rossi, Paolo Servida, 
Alberto G. Ugazio, Fulvio Mavilio 

Adenosine deaminase (ADA) deficiency results in severe combined immunodeficiency, 
the first genetic disorder treated by gene therapy. Two different retroviral vectors were 
used to transfer ex vivo the human ADA minigene into bone marrow cells and peripheral 
blood lymphocytes from two patients undergoing exogenous enzyme replacement ther- 
apy. After 2 years of treatment, long-term survival of T and B lymphocytes, marrow cells, 
and granulocytes expressing the transferred ADA gene was demonstrated and resulted 
in normalization of the immune repertoire and restoration of cellular and humoral immunity. 
After discontinuation of treatment, T lymphocytes, derived from transduced peripheral 
blood lymphocytes, were progressively replaced by marrow-derived T cells in both pa- 
tients. These results indicate successful gene transfer into long-lasting progenitor cells, 
producing a functional multilineage progeny. 



Severe combined immunodeficiency asso- 
ciated with inherited deficiency of ADA 
(J) is usually fatal unless affected children 
are kept in protective isolation or the im- 
mune system is reconstituted by bone mar- 
row transplantation from a human leuko- 
cyte antigen (HLA identical sibling donor 
(2). This is the therapy of choice, although 
it is available only for a minority of patients. 
In recent years, other forms of therapy have 
been developed, including transplants from 
haploidentical donors (3,4), exogenous en- 
ryme replacement (5), and somatic-cell 
gene therapy (6-9). 

We previously reported a preclinical mod- 
el in which ADA gene transfer and expression 
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successfully restored immune functions in hu- 
man ADA-deficient (ADA") peripheral 
blood lymphocytes (PBLs) in irrununodcfi- 
cient mice in vivo (JO, J I J. On the basts of 
these preclinical results, the clinical applica- 
tion of gene therapy for the treatment of 
ADA" SCID (severe combined immunodefi- 
ciency disease) patients who previously failed 
exogenous enzyme replacement therapy was 
approved by our Institutional Ethical Com- 
mittees and by the Italian National Commit- 
tee for Bioethics {12). In addition to evaluat- 
ing the safety and efficacy of the gene therapy 
procedure, the aim of the study was to define 
. the relative role of PBLs and hematopoietic 
stem cells in the long-term reconstitution of 
immune functions after retroviral vector-me- 
diated ADA gene transfer. For this purpose, 
two structurally identical vectors expressing 
the human ADA complementary DNA 
(cDNA), distinguishable by the presence of 
alternative restriction sites in a nonfunctional 
region of the viral long-terminal repeat 
(LTR), were used to transduce PBLs and bone 
marrow (BM) cells independently. This pro- 
cedure allowed identification of the origin of 
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